Обсуждение: shared-memory based stats collector

Поиск
Список
Период
Сортировка

shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

This is intended to provide more stats like the following thread.

https://www.postgresql.org/message-id/20171010.192616.108347483.horiguchi.kyotaro@lab.ntt.co.jp

Most major obstracle for having more items in statistics
collector views comes from the mechanism to share the values
among backends. It is currently using a file. The stats collector
writes a file by triggers from backens then backens reads the
written file. Larger file makes the latency longer and we don't
have a spare bandwidth for additional statistics items.

Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.

A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.

With this patch dshash can create a local copy based on dynhash.

This patch consists of tree files.

v1-0001-Give-dshash-ability-to-make-a-local-snapshot.patch

  adds dshash to make a local copy backed by dynahash.

v1-0002-Change-stats-collector-to-an-axiliary-process.patch

  change the stats collector to be a auxiliary process so that it
  can attach dynamic shared memory.

v1-0003-dshash-based-stats-collector.patch

  implements shared-memory based stats collector.

I'll put more detailed explanation later.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f1e033f308b9bbd2f1b1a3bb4a71f0fe2c538e82 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/3] Give dshash ability to make a local snapshot

Add snapshot feature to DSHASH, that makes a dynahash that consists of
the same data.
---
 src/backend/lib/dshash.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h |  21 +++++-
 2 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..758435efe7 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -354,6 +354,48 @@ dshash_destroy(dshash_table *hash_table)
     pfree(hash_table);
 }
 
+/*
+ * take a local snapshot of a dshash table
+ */
+HTAB *
+dshash_take_snapshot(dshash_table *org_table)
+{
+    HTAB    *dest_hash;
+    HASHCTL ctl;
+    int        num_entries = 0;
+    dshash_seq_status s;
+    void *ps;
+
+    ctl.keysize = org_table->params.key_size;
+    ctl.entrysize = org_table->params.entry_size;
+
+    /*
+     *  num_entries is not a strict parameter. We don't care if new entries
+     *    are added before taking snapshot
+     */
+    num_entries = dshash_get_num_entries(org_table);
+
+    /*
+     * dshash only supports binary hash and comparator functions, which won't
+     * stop at intemediate NUL(\x0) bytes. Just specify HASH_BLOBS so that the
+     * local hash behaves in the same way.
+     */
+    dest_hash = hash_create("local snapshot of dshash",
+                            num_entries, &ctl,
+                            HASH_ELEM | HASH_BLOBS);
+                            
+    dshash_seq_init(&s, org_table, true);
+    while ((ps = dshash_seq_next(&s)) != NULL)
+    {
+        bool found;
+        void *pd = hash_search(dest_hash, ps, HASH_ENTER, &found);
+        Assert(!found);
+        memcpy(pd, ps, ctl.entrysize);
+    }
+
+    return dest_hash;
+}
+
 /*
  * Get a handle that can be used by other processes to attach to this hash
  * table.
@@ -592,6 +634,128 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * Initialize a sequential scan on the hash_table. Allows no modifications
+ * during a scan if consistent = true.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+    status->curpartition = -1;
+    status->consistent = consistent;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
+    }
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+        status->hash_table->find_locked = true;
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_release(status);
+            return NULL;
+        }
+        Assert(status->hash_table->find_locked);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+        /*
+         * we need a lock on the scanning partition even if the caller don't
+         * requested a consistent snapshot.
+         */
+        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        {
+            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+                                                       next_item_pointer);
+            int next_partition = PARTITION_FOR_HASH(item->hash);
+            if (status->curpartition != next_partition)
+            {
+                if (status->curpartition >= 0)
+                    LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                                 status->curpartition));
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              LW_SHARED);
+                status->curpartition = next_partition;
+            }
+        }
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_release(dshash_seq_status *status)
+{
+    Assert(status->hash_table->find_locked);
+    status->hash_table->find_locked = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table, false);
+    while ((p = dshash_seq_next(&s)) != NULL)
+        n++;
+
+    return n;
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..eb2e45cf66 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -15,6 +15,7 @@
 #define DSHASH_H
 
 #include "utils/dsa.h"
+#include "utils/hsearch.h"
 
 /* The opaque type representing a hash table. */
 struct dshash_table;
@@ -59,6 +60,18 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    int                    curpartition;
+    bool                consistent;
+};
+
+typedef struct dshash_seq_status dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +83,7 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
+extern HTAB * dshash_take_snapshot(dshash_table *org_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +93,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_release(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 26d1d99e5584cc4868ffc7cb6c20d4303276bdf5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 2/3] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++++
 src/backend/postmaster/pgstat.c     | 58 +++++++++++++++++++++++++------------
 src/backend/postmaster/postmaster.c | 24 +++++++++------
 src/include/miscadmin.h             |  3 +-
 src/include/pgstat.h                |  6 +++-
 5 files changed, 70 insertions(+), 29 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee63e..8f8327495a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -462,6 +465,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..6d9344fcca 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -770,11 +771,7 @@ pgstat_start(void)
             /* Close the postmaster's sockets */
             ClosePostmasterPorts(false);
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
+            PgstatCollectorMain();
             break;
 #endif
 
@@ -2870,6 +2867,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4132,6 +4132,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4249,8 +4252,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4263,8 +4266,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4309,14 +4312,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4504,14 +4507,21 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4526,6 +4536,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cd..a6209cd749 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -145,7 +145,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -551,6 +552,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1760,7 +1762,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2878,7 +2880,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2951,7 +2953,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3037,10 +3039,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3270,7 +3272,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -5071,7 +5073,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5361,6 +5363,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8fcb..53b260cb1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,7 +400,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
@@ -412,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..0b9609f96e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1352,4 +1353,7 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* Main loop */
+extern void PgstatCollectorMain(void);
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From 02891365848ad56494864c8107da436e3f7909d1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 17:05:46 +0900
Subject: [PATCH 3/3] dshash-based stats collector

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash.
---
 src/backend/postmaster/autovacuum.c           |    6 +-
 src/backend/postmaster/pgstat.c               | 1259 ++++++++++---------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/pgstat.h                          |   51 +-
 src/include/storage/lwlock.h                  |    3 +
 12 files changed, 547 insertions(+), 859 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 02e6d8131e..4ece3f411d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2768,12 +2768,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6d9344fcca..626e3c6867 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -77,22 +77,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +89,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -127,14 +114,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -154,6 +133,42 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +265,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,23 +304,22 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+/* functions used in backends */
+static bool backend_take_stats_snapshot(void);
+static void pgstat_read_current_status(void);
 
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
@@ -320,7 +338,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +702,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -1009,6 +1025,93 @@ pgstat_send_funcstats(void)
 }
 
 
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* top level varialbles. lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find database stats entry on backends.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid)
+{
+    Assert(dbent->snapshot_tables);
+    return hash_search(dbent->snapshot_tables, &relid, HASH_FIND, NULL);
+}
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -1030,11 +1133,9 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_take_stats_snapshot())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
@@ -1045,7 +1146,7 @@ pgstat_vacuum_stat(void)
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
+    hash_seq_init(&hstat, snapshot_db_stats);
     while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
@@ -1064,10 +1165,10 @@ pgstat_vacuum_stat(void)
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    dbentry = (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
                                                  (void *) &MyDatabaseId,
                                                  HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    if (dbentry == NULL || dbentry->snapshot_tables == NULL)
         return;
 
     /*
@@ -1083,7 +1184,7 @@ pgstat_vacuum_stat(void)
     /*
      * Check for all tables listed in stats hashtable if they still exist.
      */
-    hash_seq_init(&hstat, dbentry->tables);
+    hash_seq_init(&hstat, dbentry->snapshot_tables);
     while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
@@ -1134,8 +1235,8 @@ pgstat_vacuum_stat(void)
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->snapshot_functions != NULL &&
+        hash_get_num_entries(dbentry->snapshot_functions) > 0)
     {
         htab = pgstat_collect_oids(ProcedureRelationId);
 
@@ -1143,7 +1244,7 @@ pgstat_vacuum_stat(void)
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
+        hash_seq_init(&hstat, dbentry->snapshot_functions);
         while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
@@ -1551,24 +1652,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2383,16 +2466,14 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
     /*
      * Lookup the requested database; return NULL if not found
      */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    return (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
                                               (void *) &dbid,
                                               HASH_FIND, NULL);
 }
@@ -2415,23 +2496,22 @@ pgstat_fetch_stat_tabentry(Oid relid)
     PgStat_StatTabEntry *tabentry;
 
     /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
+     * If not done for this transaction, take a stats snapshot
      */
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
+    /* Lookup our database, then look in its table hash table. */
     dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
+    dbentry = (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
+                                                 (void *)&dbid,
                                                  HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
+    if (dbentry != NULL && dbentry->snapshot_tables != NULL)
     {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
+        tabentry = (PgStat_StatTabEntry *)
+            hash_search(dbentry->snapshot_tables, (void *)&relid,
+                        HASH_FIND, NULL);
+
         if (tabentry)
             return tabentry;
     }
@@ -2440,14 +2520,15 @@ pgstat_fetch_stat_tabentry(Oid relid)
      * If we didn't find it, maybe it's a shared table.
      */
     dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    dbentry = (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
                                                  (void *) &dbid,
                                                  HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
+    if (dbentry != NULL && dbentry->snapshot_tables != NULL)
     {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
+        tabentry = (PgStat_StatTabEntry *)
+            hash_search(dbentry->snapshot_tables, (void *) &relid,
+                        HASH_FIND, NULL);
+
         if (tabentry)
             return tabentry;
     }
@@ -2469,18 +2550,15 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
+    if (dbentry != NULL && dbentry->snapshot_functions != NULL)
+        funcentry = hash_search(dbentry->snapshot_functions,
+                                (void *) &func_id, HASH_FIND, NULL);
     return funcentry;
 }
 
@@ -2555,9 +2633,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2572,9 +2652,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4277,18 +4359,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4330,13 +4408,6 @@ PgstatCollectorMain(void)
                 ProcessConfigFile(PGC_SIGHUP);
             }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
             /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
@@ -4385,10 +4456,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4477,9 +4544,7 @@ PgstatCollectorMain(void)
          * fixes that, so don't sleep indefinitely.  This is a crock of the
          * first water, but until somebody wants to debug exactly what's
          * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * timeout matches our pre-9.2 behavior.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4499,7 +4564,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4549,14 +4614,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4582,20 +4647,14 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
 }
 
 /*
@@ -4608,15 +4667,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4628,23 +4690,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4682,25 +4744,20 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * pgstat_write_statsfiles() -
  *        Write the global statistics file, as well as requested DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
  *    When 'allDbs' is false, only the requested databases (listed in
  *    pending_write_requests) will be written; otherwise, all databases
  *    will be written.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4721,7 +4778,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4733,32 +4790,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4802,9 +4856,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
     /*
      * Now throw away the list of requests.  Note that requests sent after we
      * started the write are still waiting on the network socket.
@@ -4818,15 +4869,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed > len)
@@ -4844,10 +4894,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4856,9 +4906,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4885,24 +4936,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl, false);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4936,76 +4991,45 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5023,7 +5047,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5040,11 +5064,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5055,17 +5079,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5094,12 +5117,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5107,8 +5130,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5116,47 +5139,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5173,34 +5172,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(dsa_handle);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5211,7 +5223,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5270,12 +5283,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5283,6 +5297,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5304,9 +5319,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5317,6 +5332,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5336,142 +5352,50 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother to release memory for the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
@@ -5479,131 +5403,75 @@ done:
  * some hash tables.  The results will be kept until pgstat_clear_snapshot()
  * is called (typically, at end of transaction).
  */
-static void
-backend_read_statsfile(void)
+static bool
+backend_take_stats_snapshot(void)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    PgStat_StatDBEntry  *dbent;
+    HASH_SEQ_STATUS seq;
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
     Assert(!pgStatRunningInCollector);
 
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
 
-        CHECK_FOR_INTERRUPTS();
+    if (snapshot_globalStats)
+        return true;
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
+    Assert(snapshot_archiverStats == NULL);
+    Assert(snapshot_db_stats == NULL);
 
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
+    /*
+     * the snapshot lives within the current transaction if any, the current
+     * memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
 
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
+    /*
+     * take a local snapshot for every dsahsh. It's ok if the snapshots are
+     * not in strictly consistent.
+     */
+    snapshot_db_stats = dshash_take_snapshot(db_stats);
+    hash_seq_init(&seq, snapshot_db_stats);
+    while ((dbent = (PgStat_StatDBEntry *) hash_seq_search(&seq)) != NULL)
+    {
+        dshash_table *t;
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        t = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+        dbent->snapshot_tables = dshash_take_snapshot(t);
+        dshash_detach(t);
+        t = dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+        dbent->snapshot_functions = dshash_take_snapshot(t);
+        dshash_detach(t);
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
 }
 
 
@@ -5636,6 +5504,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5645,99 +5515,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
 
     /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5750,6 +5533,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5765,6 +5549,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5772,9 +5557,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5833,6 +5617,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5845,6 +5630,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5857,27 +5644,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5903,23 +5696,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5947,19 +5737,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5974,14 +5773,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -6011,11 +5810,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6035,6 +5842,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6048,13 +5857,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6069,6 +5878,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6082,13 +5894,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6111,6 +5925,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6126,18 +5943,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6150,16 +5967,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6200,6 +6017,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6216,6 +6035,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6233,6 +6054,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6244,6 +6067,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6252,14 +6076,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6280,7 +6104,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6292,6 +6120,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6300,60 +6129,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3f1eae38a9..a517bf62b6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..ee30e8a14f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 859ef931e7..50043eb5fc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -186,7 +186,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3755,17 +3754,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10691,35 +10679,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..7aa57bc489 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -509,7 +509,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ae22e7d9fb..0c3b82b455 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -216,7 +216,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index aab2e1eecf..a646d3f972 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0b9609f96e..6608c9be93 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1217,6 +1187,7 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1344,6 +1315,9 @@ extern void pgstat_send_bgwriter(void);
  * generate the pgstat* views.
  * ----------
  */
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+extern void PgstatCollectorMain(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
@@ -1353,7 +1327,4 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-/* Main loop */
-extern void PgstatCollectorMain(void);
-
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..2cdd10c2fd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.16.3


Re: shared-memory based stats collector

От
Robert Haas
Дата:
On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
> use this as the main storage of statistics. We can share data
> without a stress using this.
>
> A PoC previously posted tried to use "locally copied" dshash but
> it doesn't looks fine so I steered to different direction.
>
> With this patch dshash can create a local copy based on dynhash.

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables).  Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem.  Still, it would
be nice if we had a better idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. Thanks for the comment.

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
> On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
> > use this as the main storage of statistics. We can share data
> > without a stress using this.
> >
> > A PoC previously posted tried to use "locally copied" dshash but
> > it doesn't looks fine so I steered to different direction.
> >
> > With this patch dshash can create a local copy based on dynhash.
> 
> Copying the whole hash table kinds of sucks, partly because of the
> time it will take to copy it, but also because it means that memory
> usage is still O(nbackends * ntables).  Without looking at the patch,
> I'm guessing that you're doing that because we need a way to show each
> transaction a consistent snapshot of the data, and I admit that I
> don't see another obvious way to tackle that problem.  Still, it would
> be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.

As the result, this version behaves as the follows.

- Stats collector stores the results in shared memory.

- In backend, cache is created only for requested objects and
  lasts for the transaction.

- Vacuum directly reads the shared stats and doesn't create a
  local copy.

The non-behavioral difference from the v1 is the follows.

- snapshot feature of dshash is removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 0f29e118092b85882dfa7a89f5de5db35f576ad5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/3] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h |  20 +++++++-
 2 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..e6c1ef44f1 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,128 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * Initialize a sequential scan on the hash_table. Allows no modifications
+ * during a scan if consistent = true.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+    status->curpartition = -1;
+    status->consistent = consistent;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
+    }
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+        status->hash_table->find_locked = true;
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_release(status);
+            return NULL;
+        }
+        Assert(status->hash_table->find_locked);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+        /*
+         * we need a lock on the scanning partition even if the caller don't
+         * requested a consistent snapshot.
+         */
+        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        {
+            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+                                                       next_item_pointer);
+            int next_partition = PARTITION_FOR_HASH(item->hash);
+            if (status->curpartition != next_partition)
+            {
+                if (status->curpartition >= 0)
+                    LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                                 status->curpartition));
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              LW_SHARED);
+                status->curpartition = next_partition;
+            }
+        }
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_release(dshash_seq_status *status)
+{
+    Assert(status->hash_table->find_locked);
+    status->hash_table->find_locked = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table, false);
+    while ((p = dshash_seq_next(&s)) != NULL)
+        n++;
+
+    return n;
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..a4565d3219 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -15,6 +15,7 @@
 #define DSHASH_H
 
 #include "utils/dsa.h"
+#include "utils/hsearch.h"
 
 /* The opaque type representing a hash table. */
 struct dshash_table;
@@ -59,6 +60,18 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    int                    curpartition;
+    bool                consistent;
+};
+
+typedef struct dshash_seq_status dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +83,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +92,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_release(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 29ad0c5dda308eda447fa48eb920131cd818a1e4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 2/3] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++++
 src/backend/postmaster/pgstat.c     | 58 +++++++++++++++++++++++++------------
 src/backend/postmaster/postmaster.c | 24 +++++++++------
 src/include/miscadmin.h             |  3 +-
 src/include/pgstat.h                |  6 +++-
 5 files changed, 70 insertions(+), 29 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee63e..8f8327495a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -462,6 +465,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index bbe73618c7..c09d837afd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -770,11 +771,7 @@ pgstat_start(void)
             /* Close the postmaster's sockets */
             ClosePostmasterPorts(false);
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
+            PgstatCollectorMain();
             break;
 #endif
 
@@ -2870,6 +2867,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4135,6 +4135,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4252,8 +4255,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4266,8 +4269,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4312,14 +4315,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4507,14 +4510,21 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4529,6 +4539,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cd..a6209cd749 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -145,7 +145,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -551,6 +552,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1760,7 +1762,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2878,7 +2880,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2951,7 +2953,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3037,10 +3039,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3270,7 +3272,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -5071,7 +5073,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5361,6 +5363,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8fcb..53b260cb1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,7 +400,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
@@ -412,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae23..87957e093a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1353,4 +1354,7 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* Main loop */
+extern void PgstatCollectorMain(void);
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From 29db0930b04f2c2728b779d41873f15a413c2c6b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 17:05:46 +0900
Subject: [PATCH 3/3] dshash-based stats collector

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash. The stats entries are cached
one by one to give a consistent snapshot during a transaction. On the
other hand vacuum no longer take a complete cache of stats.
---
 src/backend/postmaster/autovacuum.c           |   11 +-
 src/backend/postmaster/pgstat.c               | 1499 ++++++++++++-------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/pgstat.h                          |   51 +-
 src/include/storage/lwlock.h                  |    3 +
 12 files changed, 756 insertions(+), 895 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 02e6d8131e..74e3ab6167 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2114,6 +2114,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2193,10 +2195,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2768,12 +2771,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c09d837afd..40b252540f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -77,22 +77,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +89,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -127,14 +114,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -154,6 +133,42 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +265,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,23 +304,23 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+/* functions used in backends */
+static bool backend_snapshot_database_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+static void pgstat_read_current_status(void);
 
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
@@ -320,7 +339,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +703,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -1009,6 +1026,81 @@ pgstat_send_funcstats(void)
 }
 
 
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* top level varialbles. lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -1022,6 +1114,8 @@ pgstat_vacuum_stat(void)
     PgStat_MsgTabpurge msg;
     PgStat_MsgFuncpurge f_msg;
     HASH_SEQ_STATUS hstat;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
@@ -1030,11 +1124,9 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_database_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
@@ -1045,7 +1137,7 @@ pgstat_vacuum_stat(void)
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
+    hash_seq_init(&hstat, snapshot_db_stats);
     while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
@@ -1064,12 +1156,12 @@ pgstat_vacuum_stat(void)
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    dbentry = (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
                                                  (void *) &MyDatabaseId,
                                                  HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    if (dbentry == NULL)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -1082,9 +1174,11 @@ pgstat_vacuum_stat(void)
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
@@ -1113,6 +1207,7 @@ pgstat_vacuum_stat(void)
             msg.m_nentries = 0;
         }
     }
+    dshash_detach(dshtable);
 
     /*
      * Send the rest
@@ -1134,8 +1229,8 @@ pgstat_vacuum_stat(void)
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    if (dshash_get_num_entries(dshtable) > 0)
     {
         htab = pgstat_collect_oids(ProcedureRelationId);
 
@@ -1143,8 +1238,8 @@ pgstat_vacuum_stat(void)
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
@@ -1185,6 +1280,7 @@ pgstat_vacuum_stat(void)
 
         hash_destroy(htab);
     }
+    dshash_detach(dshtable);
 }
 
 
@@ -1551,24 +1647,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2383,18 +2461,15 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
     /*
      * Lookup the requested database; return NULL if not found
      */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    return (PgStat_StatDBEntry *)hash_search(snapshot_db_stats,
+                                             &dbid, HASH_FIND, NULL);
 }
 
 
@@ -2410,47 +2485,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2469,17 +2525,18 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+    if (funcentry == NULL)
+        return NULL;
 
     return funcentry;
 }
@@ -2555,9 +2612,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2572,9 +2631,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4280,18 +4341,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4333,13 +4390,6 @@ PgstatCollectorMain(void)
                 ProcessConfigFile(PGC_SIGHUP);
             }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
             /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
@@ -4388,10 +4438,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4480,9 +4526,7 @@ PgstatCollectorMain(void)
          * fixes that, so don't sleep indefinitely.  This is a crock of the
          * first water, but until somebody wants to debug exactly what's
          * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * timeout matches our pre-9.2 behavior.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4502,7 +4546,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4552,14 +4596,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4585,20 +4629,14 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
 }
 
 /*
@@ -4611,15 +4649,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4631,23 +4672,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4685,25 +4726,20 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * pgstat_write_statsfiles() -
  *        Write the global statistics file, as well as requested DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
  *    When 'allDbs' is false, only the requested databases (listed in
  *    pending_write_requests) will be written; otherwise, all databases
  *    will be written.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4724,7 +4760,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4736,32 +4772,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4805,9 +4838,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
     /*
      * Now throw away the list of requests.  Note that requests sent after we
      * started the write are still waiting on the network socket.
@@ -4821,15 +4851,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed > len)
@@ -4847,10 +4876,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4859,9 +4888,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4888,24 +4918,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl, false);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4939,76 +4973,45 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5026,7 +5029,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5043,11 +5046,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5058,17 +5061,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5097,12 +5099,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5110,8 +5112,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5119,47 +5121,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5176,34 +5154,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(dsa_handle);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5214,7 +5205,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5273,12 +5265,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5286,6 +5279,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5307,9 +5301,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5320,6 +5314,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5339,276 +5334,319 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(size_t keysize, size_t entrysize, int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    result = hash_create("local dtabase stats hash",
+                         nentries, &ctl, HASH_ELEM | HASH_BLOBS);
 
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    return result;
 }
 
+/*
+ * take_db_stats_snapshot() -
+ *
+ * Takes snapshot of database stats hash. Different from table/function stats,
+ * we make a full copy of the shared stats hash.
+ */
+static void
+take_db_stats_snapshot(void)
+{
+    HASHCTL ctl;
+    dshash_seq_status s;
+    int num_entries;
+    void *ps;
+
+    Assert(!snapshot_db_stats);
+    Assert(db_stats);
+
+    ctl.keysize = dsh_dbparams.key_size;
+    ctl.entrysize = dsh_dbparams.entry_size;
+
+    num_entries = dshash_get_num_entries(db_stats);
+
+    snapshot_db_stats = hash_create("local database stats hash",
+                                    num_entries, &ctl, HASH_ELEM | HASH_BLOBS);
+
+    /* Copy all entries from dshash */
+    dshash_seq_init(&s, db_stats, true);
+    while ((ps = dshash_seq_next(&s)) != NULL)
+    {
+        void *pd = hash_search(snapshot_db_stats, ps, HASH_ENTER, NULL);
+        memcpy(pd, ps, ctl.entrysize);
+    }
+}
+
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for the key. Returns NULL if not found. If dest is a
+ * valid pointer it is used for local hash with which the requests for the
+ * same key returns the same entry within a transaction. Otherwise returned
+ * entry is a palloc'ed copy, which may differ for every request.
+ *
+ * If key is InvalidOid, just takes a local cache filled by all existing stats
+ * entryies.
+ */
+static void *
+snapshot_statentry(HTAB **dest,
+                   dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params,
+                   Oid key, size_t keysize, size_t entrysize)
+{
+    void *lentry = NULL;
+
+    /* Make a fully-filled cache in the case */
+    if (key == InvalidOid)
+    {
+        dshash_table *t;
+        dshash_seq_status s;
+        void *ps;
+        int num_entries;
+
+        /* caching is required for full cacheing */
+        Assert(dest);
+
+        /* destory existing local hash */
+
+        if (*dest)
+        {
+            hash_destroy(*dest);
+            *dest = NULL;
+        }
+
+        t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        /* No need to create new hash if no entry exists */
+        num_entries = dshash_get_num_entries(t);
+        if (num_entries == 0)
+        {
+            dshash_detach(t);
+            return NULL;
+        }
+
+        *dest = create_local_stats_hash(keysize, entrysize, num_entries);
+
+        dshash_seq_init(&s, t, true);
+        while ((ps = dshash_seq_next(&s)) != NULL)
+        {
+            bool found;
+            void *pd = hash_search(*dest, ps, HASH_ENTER, &found);
+            Assert(!found);
+            memcpy(pd, ps, entrysize);
+            /* dshash_seq_next release entry lock automatically */
+        }
+        dshash_detach(t);
+
+        return NULL;
+    }
+    else if (dest)
+    {
+        bool found;
+
+        /*
+         * Create new hash for entry cache. Make with arbitrary initial
+         * entries since we don't know how this hash will grow.
+         */
+        if (!*dest)
+            *dest = create_local_stats_hash(keysize, entrysize, 32);
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+            void *sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+            dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t;
+        void *sentry;
+
+        t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+        dshash_detach(t);
+    }
+    
+    return lentry;
+}
+
+/*
+ * backend_snapshot_database_stats() -
+ *
+ * Makes a local copy of global and database stats table if not already done.
+ * They will be kept until pgstat_clear_snapshot() is called or the end of the
+ * current memory context (typically TopTransactionContext).
+ * Returns false if the shared stats is not created yet.
+ */
+static bool
+backend_snapshot_database_stats(void)
+{
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(!pgStatRunningInCollector);
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
+
+    Assert(snapshot_archiverStats == NULL);
+    Assert(snapshot_db_stats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    take_db_stats_snapshot();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              dbent->tables, &dsh_tblparams,
+                              reloid,
+                              dsh_tblparams.key_size,
+                              dsh_tblparams.entry_size);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              dbent->functions, &dsh_funcparams,
+                              funcid,
+                              dsh_funcparams.key_size,
+                              dsh_funcparams.entry_size);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5639,6 +5677,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5648,99 +5688,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
 
     /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5753,6 +5706,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5768,6 +5722,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5775,9 +5730,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5836,6 +5790,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5848,6 +5803,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5860,27 +5817,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5906,23 +5869,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5950,19 +5910,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5977,14 +5946,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -6014,11 +5983,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6038,6 +6015,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6051,13 +6030,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6072,6 +6051,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6085,13 +6067,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6114,6 +6098,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6129,18 +6116,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6153,16 +6140,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6203,6 +6190,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6219,6 +6208,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6236,6 +6227,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6247,6 +6240,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6255,14 +6249,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6283,7 +6277,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6295,6 +6293,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6303,60 +6302,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3f1eae38a9..a517bf62b6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..ee30e8a14f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b05fb209bb..2c1e092afe 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -186,7 +186,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3755,17 +3754,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10691,35 +10679,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..7aa57bc489 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -509,7 +509,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ae22e7d9fb..0c3b82b455 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -216,7 +216,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 87957e093a..253b42d71f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1218,6 +1188,7 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1345,6 +1316,9 @@ extern void pgstat_send_bgwriter(void);
  * generate the pgstat* views.
  * ----------
  */
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+extern void PgstatCollectorMain(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
@@ -1354,7 +1328,4 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-/* Main loop */
-extern void PgstatCollectorMain(void);
-
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..2cdd10c2fd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. This is new version fixed windows build.

At Tue, 03 Jul 2018 19:01:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180703.190144.222427588.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello. Thanks for the comment.
> 
> At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
> > Copying the whole hash table kinds of sucks, partly because of the
> > time it will take to copy it, but also because it means that memory
> > usage is still O(nbackends * ntables).  Without looking at the patch,
> > I'm guessing that you're doing that because we need a way to show each
> > transaction a consistent snapshot of the data, and I admit that I
> > don't see another obvious way to tackle that problem.  Still, it would
> > be nice if we had a better idea.
> 
> The consistency here means "repeatable read" of an object's stats
> entry, not a snapshot covering all objects. We don't need to copy
> all the entries at once following this definition. The attached
> version makes a cache entry only for requested objects.
> 
> Addition to that vacuum doesn't require even repeatable read
> consistency so we don't need to cache the entries at all.
> backend_get_tab_entry now returns an isolated (that means
> not-stored-in-hash) palloc'ed copy without making a local copy in
> the case.
> 
> As the result, this version behaves as the follows.
> 
> - Stats collector stores the results in shared memory.
> 
> - In backend, cache is created only for requested objects and
>   lasts for the transaction.
> 
> - Vacuum directly reads the shared stats and doesn't create a
>   local copy.
> 
> The non-behavioral difference from the v1 is the follows.
> 
> - snapshot feature of dshash is removed.

This version includes some additional patches. 0003 removes
PG_STAT_TMP_DIR and it affects pg_stat_statements, pg_basebackup
and pg_rewind. Among them pg_stat_statements gets build failure
because it uses the directory to save query texts. 0005 is a new
patch and moves the file to the permanent stat directory. With
this change pg_basebackup and pg_rewind no longer ignore the
query text file.

I haven't explicitly mentioned that, but
dynamic_shared_memory_type = none prevents server from
starting. This patch is not providing a fallback path for the
case. I'm expecting that 'none' will be removed in v12.

v3-0001-sequential-scan-for-dshash.patch
  - Functionally same with v2. got cosmetic changes.

v3-0002-Change-stats-collector-to-an-axiliary-process.patch
  - Fixed for Windows build.

v3-0003-dshash-based-stats-collector.patch
  - Cosmetic changes from v2.

v3-0004-Documentation-update.patch
  - New patch in v3 of documentation edits.

v3-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
  - New patch of tentative change of pg_stat_statements.

v3-0006-Remove-pg_stat_tmp-exclusion-from-pg_resetwal.patch
  - New patch of tentative change of pg_rewind.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 319535ed679495f9e86d417b1a23bb30ed80495f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h |  23 +++++++-
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..5b133226ac 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+    status->curpartition = -1;
+    status->consistent = consistent;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
+    }
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+        status->hash_table->find_locked = true;
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+        Assert(status->hash_table->find_locked);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+        /*
+         * we need a lock on the scanning partition even if the caller don't
+         * requested a consistent snapshot.
+         */
+        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        {
+            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+                                                       next_item_pointer);
+            int next_partition = PARTITION_FOR_HASH(item->hash);
+            if (status->curpartition != next_partition)
+            {
+                if (status->curpartition >= 0)
+                    LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                                 status->curpartition));
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              LW_SHARED);
+                status->curpartition = next_partition;
+            }
+        }
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->find_locked);
+    status->hash_table->find_locked = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table, false);
+    while ((p = dshash_seq_next(&s)) != NULL)
+        n++;
+
+    return n;
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8598d9ed84 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -15,6 +15,7 @@
 #define DSHASH_H
 
 #include "utils/dsa.h"
+#include "utils/hsearch.h"
 
 /* The opaque type representing a hash table. */
 struct dshash_table;
@@ -59,6 +60,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    int                    curpartition;
+    bool                consistent;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 2349b3f66c217e55a8fb8b4fe33fa8ed74c66054 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 2/6] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |   8 ++
 src/backend/postmaster/pgstat.c     | 158 +++++++++++-------------------------
 src/backend/postmaster/postmaster.c |  30 +++----
 src/include/miscadmin.h             |   3 +-
 src/include/pgstat.h                |  11 ++-
 5 files changed, 77 insertions(+), 133 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee63e..8f8327495a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -462,6 +465,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index bbe73618c7..3193178f32 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -688,104 +689,6 @@ pgstat_reset_all(void)
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 void
 allow_immediate_pgstat_restart(void)
 {
@@ -2870,6 +2773,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4135,6 +4041,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4252,8 +4161,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4266,8 +4175,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4312,14 +4221,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4507,14 +4416,29 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4529,6 +4453,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cd..a37a18696e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -145,7 +145,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -551,6 +552,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1760,7 +1762,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2878,7 +2880,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2951,7 +2953,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3037,10 +3039,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3270,7 +3272,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -4949,12 +4951,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         PgArchiverMain(argc, argv); /* does not return */
     }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5071,7 +5067,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5361,6 +5357,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8fcb..53b260cb1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,7 +400,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
@@ -412,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae23..e97b25bd72 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1160,11 +1161,6 @@ extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
 extern void allow_immediate_pgstat_restart(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
-
 /* ----------
  * Functions called from backends
  * ----------
@@ -1353,4 +1349,7 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* Main loop */
+extern void PgstatCollectorMain(void) pg_attribute_noreturn();
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From 4b7fa88be0d65bb922d1dd0330711692bef77042 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 17:05:46 +0900
Subject: [PATCH 3/6] dshash-based stats collector

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash. The stats entries are cached
one by one to give a consistent snapshot during a transaction. On the
other hand vacuum no longer take a complete cache of stats.

This patch removes PG_STAT_TMP_DIR and GUC stats_temp_directory.  That
affects pg_basebackup and pg_stat_statements but this patch fixes only
pg_basbackup. Fix for pg_stat_statements is done in another patch.
---
 src/backend/postmaster/autovacuum.c           |   11 +-
 src/backend/postmaster/pgstat.c               | 1499 ++++++++++++-------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/pgstat.h                          |   48 +-
 src/include/storage/lwlock.h                  |    3 +
 12 files changed, 756 insertions(+), 892 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 02e6d8131e..74e3ab6167 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2114,6 +2114,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2193,10 +2195,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2768,12 +2771,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3193178f32..ef283f17ab 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -77,22 +77,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +89,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -127,14 +114,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -154,6 +133,42 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +265,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,23 +304,23 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+/* functions used in backends */
+static bool backend_snapshot_database_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+static void pgstat_read_current_status(void);
 
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
@@ -320,7 +339,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +703,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -915,6 +932,81 @@ pgstat_send_funcstats(void)
 }
 
 
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* top level varialbles. lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -928,6 +1020,8 @@ pgstat_vacuum_stat(void)
     PgStat_MsgTabpurge msg;
     PgStat_MsgFuncpurge f_msg;
     HASH_SEQ_STATUS hstat;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
@@ -936,11 +1030,9 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_database_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
@@ -951,7 +1043,7 @@ pgstat_vacuum_stat(void)
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
+    hash_seq_init(&hstat, snapshot_db_stats);
     while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
@@ -970,12 +1062,12 @@ pgstat_vacuum_stat(void)
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    dbentry = (PgStat_StatDBEntry *) hash_search(snapshot_db_stats,
                                                  (void *) &MyDatabaseId,
                                                  HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    if (dbentry == NULL)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -988,9 +1080,11 @@ pgstat_vacuum_stat(void)
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
@@ -1019,6 +1113,7 @@ pgstat_vacuum_stat(void)
             msg.m_nentries = 0;
         }
     }
+    dshash_detach(dshtable);
 
     /*
      * Send the rest
@@ -1040,8 +1135,8 @@ pgstat_vacuum_stat(void)
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    if (dshash_get_num_entries(dshtable) > 0)
     {
         htab = pgstat_collect_oids(ProcedureRelationId);
 
@@ -1049,8 +1144,8 @@ pgstat_vacuum_stat(void)
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
@@ -1091,6 +1186,7 @@ pgstat_vacuum_stat(void)
 
         hash_destroy(htab);
     }
+    dshash_detach(dshtable);
 }
 
 
@@ -1457,24 +1553,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2289,18 +2367,15 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
     /*
      * Lookup the requested database; return NULL if not found
      */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    return (PgStat_StatDBEntry *)hash_search(snapshot_db_stats,
+                                             &dbid, HASH_FIND, NULL);
 }

 
@@ -2316,47 +2391,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2375,17 +2431,18 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+    if (funcentry == NULL)
+        return NULL;
 
     return funcentry;
 }
@@ -2461,9 +2518,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2478,9 +2537,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_database_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4186,18 +4247,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4239,13 +4296,6 @@ PgstatCollectorMain(void)
                 ProcessConfigFile(PGC_SIGHUP);
             }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
             /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
@@ -4294,10 +4344,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4386,9 +4432,7 @@ PgstatCollectorMain(void)
          * fixes that, so don't sleep indefinitely.  This is a crock of the
          * first water, but until somebody wants to debug exactly what's
          * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * timeout matches our pre-9.2 behavior.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4408,7 +4452,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4466,14 +4510,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4499,20 +4543,14 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
 }
 
 /*
@@ -4525,15 +4563,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4545,23 +4586,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4599,25 +4640,20 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * pgstat_write_statsfiles() -
  *        Write the global statistics file, as well as requested DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
  *    When 'allDbs' is false, only the requested databases (listed in
  *    pending_write_requests) will be written; otherwise, all databases
  *    will be written.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4638,7 +4674,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4650,32 +4686,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4719,9 +4752,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
     /*
      * Now throw away the list of requests.  Note that requests sent after we
      * started the write are still waiting on the network socket.
@@ -4735,15 +4765,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed > len)
@@ -4761,10 +4790,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4773,9 +4802,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4802,24 +4832,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl, false);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4853,76 +4887,45 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4940,7 +4943,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -4957,11 +4960,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -4972,17 +4975,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5011,12 +5013,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5024,8 +5026,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5033,47 +5035,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5090,34 +5068,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(dsa_handle);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5128,7 +5119,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5187,12 +5179,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5200,6 +5193,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5221,9 +5215,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5234,6 +5228,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5253,276 +5248,319 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(size_t keysize, size_t entrysize, int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    result = hash_create("local dtabase stats hash",
+                         nentries, &ctl, HASH_ELEM | HASH_BLOBS);
 
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    return result;
 }
 
+/*
+ * take_db_stats_snapshot() -
+ *
+ * Takes snapshot of database stats hash. Different from table/function stats,
+ * we make a full copy of the shared stats hash.
+ */
+static void
+take_db_stats_snapshot(void)
+{
+    HASHCTL ctl;
+    dshash_seq_status s;
+    int num_entries;
+    void *ps;
+
+    Assert(!snapshot_db_stats);
+    Assert(db_stats);
+
+    ctl.keysize = dsh_dbparams.key_size;
+    ctl.entrysize = dsh_dbparams.entry_size;
+
+    num_entries = dshash_get_num_entries(db_stats);
+
+    snapshot_db_stats = hash_create("local database stats hash",
+                                    num_entries, &ctl, HASH_ELEM | HASH_BLOBS);
+
+    /* Copy all entries from dshash */
+    dshash_seq_init(&s, db_stats, true);
+    while ((ps = dshash_seq_next(&s)) != NULL)
+    {
+        void *pd = hash_search(snapshot_db_stats, ps, HASH_ENTER, NULL);
+        memcpy(pd, ps, ctl.entrysize);
+    }
+}
+
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for the key. Returns NULL if not found. If dest is a
+ * valid pointer it is used for local hash with which the requests for the
+ * same key returns the same entry within a transaction. Otherwise returned
+ * entry is a palloc'ed copy, which may differ for every request.
+ *
+ * If key is InvalidOid, just takes a local cache filled by all existing stats
+ * entryies.
+ */
+static void *
+snapshot_statentry(HTAB **dest,
+                   dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params,
+                   Oid key, size_t keysize, size_t entrysize)
+{
+    void *lentry = NULL;
+
+    /* Make a fully-filled cache in the case */
+    if (key == InvalidOid)
+    {
+        dshash_table *t;
+        dshash_seq_status s;
+        void *ps;
+        int num_entries;
+
+        /* caching is required for full cacheing */
+        Assert(dest);
+
+        /* destory existing local hash */
+
+        if (*dest)
+        {
+            hash_destroy(*dest);
+            *dest = NULL;
+        }
+
+        t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        /* No need to create new hash if no entry exists */
+        num_entries = dshash_get_num_entries(t);
+        if (num_entries == 0)
+        {
+            dshash_detach(t);
+            return NULL;
+        }
+
+        *dest = create_local_stats_hash(keysize, entrysize, num_entries);
+
+        dshash_seq_init(&s, t, true);
+        while ((ps = dshash_seq_next(&s)) != NULL)
+        {
+            bool found;
+            void *pd = hash_search(*dest, ps, HASH_ENTER, &found);
+            Assert(!found);
+            memcpy(pd, ps, entrysize);
+            /* dshash_seq_next release entry lock automatically */
+        }
+        dshash_detach(t);
+
+        return NULL;
+    }
+    else if (dest)
+    {
+        bool found;
+
+        /*
+         * Create new hash for entry cache. Make with arbitrary initial
+         * entries since we don't know how this hash will grow.
+         */
+        if (!*dest)
+            *dest = create_local_stats_hash(keysize, entrysize, 32);
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+            void *sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+            dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t;
+        void *sentry;
+
+        t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+        dshash_detach(t);
+    }
+    
+    return lentry;
+}
+
+/*
+ * backend_snapshot_database_stats() -
+ *
+ * Makes a local copy of global and database stats table if not already done.
+ * They will be kept until pgstat_clear_snapshot() is called or the end of the
+ * current memory context (typically TopTransactionContext).
+ * Returns false if the shared stats is not created yet.
+ */
+static bool
+backend_snapshot_database_stats(void)
+{
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(!pgStatRunningInCollector);
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
+
+    Assert(snapshot_archiverStats == NULL);
+    Assert(snapshot_db_stats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    take_db_stats_snapshot();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              dbent->tables, &dsh_tblparams,
+                              reloid,
+                              dsh_tblparams.key_size,
+                              dsh_tblparams.entry_size);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              dbent->functions, &dsh_funcparams,
+                              funcid,
+                              dsh_funcparams.key_size,
+                              dsh_funcparams.entry_size);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5553,6 +5591,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5562,99 +5602,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
 
     /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5667,6 +5620,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5682,6 +5636,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5689,9 +5644,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5750,6 +5704,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5762,6 +5717,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5774,27 +5731,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5820,23 +5783,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5864,19 +5824,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5891,14 +5860,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -5928,11 +5897,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5952,6 +5929,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5965,13 +5944,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -5986,6 +5965,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5999,13 +5981,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6028,6 +6012,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6043,18 +6030,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6067,16 +6054,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6117,6 +6104,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6133,6 +6122,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6150,6 +6141,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6161,6 +6154,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6169,14 +6163,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6197,7 +6191,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6209,6 +6207,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6217,60 +6216,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3f1eae38a9..a517bf62b6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..ee30e8a14f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b05fb209bb..2c1e092afe 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -186,7 +186,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3755,17 +3754,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10691,35 +10679,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..7aa57bc489 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -509,7 +509,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ae22e7d9fb..0c3b82b455 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -216,7 +216,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e97b25bd72..88bb1e636b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1213,6 +1183,7 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1352,4 +1323,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 /* Main loop */
 extern void PgstatCollectorMain(void) pg_attribute_noreturn();
 
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..2cdd10c2fd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.16.3

From 0bcdef42f678e5a1bbf936d3315f1dce6bf06000 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 4/6] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1..8430c1a3cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6081,25 +6081,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..a01f68e99b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15889,8 +15889,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 36d393d329..f23dd2aaa0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index cfc805f785..4b8ea2a6b8 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2624,7 +2624,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 520d843f0e..416cb0f4f5 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -234,8 +234,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From 3fec49dd2e0c959dc4da9fd3b89df841a63ce786 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 10:59:17 +0900
Subject: [PATCH 5/6] Let pg_stat_statements not to use PG_STAT_TMP_DIR.

This patchset removes the definition because pg_stat.c no longer uses
the directory and no other sutable module to pass it over. As a
tentative solution this patch moves query text file into permanent
stats directory. pg_basebackup and pg_rewind are conscious of the
directory. They currently omit the text file but becomes to copy it by
this change.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index cc9efab243..cdff585e76 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
-- 
2.16.3

From f2179d3e1745ee05e975cd85525e4c64e6dedae5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:46:43 +0900
Subject: [PATCH 6/6] Remove pg_stat_tmp exclusion from pg_rewind

The directory "pg_stat_tmp" no longer exists so remove it from the
exclusion list.
---
 src/bin/pg_rewind/filemap.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 8f49d34652..a849e62558 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, const char *type);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
-- 
2.16.3


Re: shared-memory based stats collector

От
Tom Lane
Дата:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
>> Copying the whole hash table kinds of sucks, partly because of the
>> time it will take to copy it, but also because it means that memory
>> usage is still O(nbackends * ntables).  Without looking at the patch,
>> I'm guessing that you're doing that because we need a way to show each
>> transaction a consistent snapshot of the data, and I admit that I
>> don't see another obvious way to tackle that problem.  Still, it would
>> be nice if we had a better idea.

> The consistency here means "repeatable read" of an object's stats
> entry, not a snapshot covering all objects. We don't need to copy
> all the entries at once following this definition. The attached
> version makes a cache entry only for requested objects.

Uh, what?  That's basically destroying the long-standing semantics of
statistics snapshots.  I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

            regards, tom lane


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 04 Jul 2018 17:23:51 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <67470.1530739431@sss.pgh.pa.us>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> > At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
> >> Copying the whole hash table kinds of sucks, partly because of the
> >> time it will take to copy it, but also because it means that memory
> >> usage is still O(nbackends * ntables).  Without looking at the patch,
> >> I'm guessing that you're doing that because we need a way to show each
> >> transaction a consistent snapshot of the data, and I admit that I
> >> don't see another obvious way to tackle that problem.  Still, it would
> >> be nice if we had a better idea.
> 
> > The consistency here means "repeatable read" of an object's stats
> > entry, not a snapshot covering all objects. We don't need to copy
> > all the entries at once following this definition. The attached
> > version makes a cache entry only for requested objects.
> 
> Uh, what?  That's basically destroying the long-standing semantics of
> statistics snapshots.  I do not think we can consider that acceptable.
> As an example, it would mean that scan counts for indexes would not
> match up with scan counts for their tables.

The current stats collector mechanism sends at most 8 table stats
in a single message. Split messages from multiple transactions
can reach to collector in shuffled order. The resulting snapshot
can be "inconsistent" if INQUIRY message comes between such split
messages.  Of course a single meesage would be enough for common
transactions but not for all.

Even though the inconsistency happens more frequently with this
patch, I don't think users expect such strict consistency of
table stats, especially on a busy system. And I believe it's a
good thing if users saw more "useful" information for the relaxed
consistency. (The actual meaning of "useful" is out of the
current focus:p)


Meanwhile, if we should keep the practical-consistency, a giant
lock is out of the question. So we need a transactional stats of
any shape. It can be a whole-image snapshot or a regular MMVC
table, or maybe the current dshash with UNDO logs. Since there
are actually many states, it is inevitable to require storage to
reproduce each state.

I think the consensus is that the whole-image snapshot takes
too-much memory. MMVC is apparently too-much for the purpose.

UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.

I'll consider that apart from the current patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Thu, 05 Jul 2018 12:04:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180705.120423.49626073.horiguchi.kyotaro@lab.ntt.co.jp>
> UNDO logs seems a bit promising. If we looking stats in a long
> transaction, the required memory for UNDO information easily
> reaches to the same amount with the whole-image snapshot. But I
> expect that it is not so common.
> 
> I'll consider that apart from the current patch.

Done as a PoC. (Sorry for the format since filterdiff genearates
a garbage from the patch..)

The attached v3-0008 is that.  PoC of UNDO logging of server
stats. It records undo logs only for table stats if any
transaction started acess to stats data. So the logging is rarely
performed. The undo logs are used at the first acess to each
relation's stats then cached. autovacuum and vacuum doesn't
request undoing since they just don't need it.

# v3-0007 is a trivial fix for v3-0003, which will be merged.

I see several arguable points on this feature.

- The undo logs are stored in a ring buffer with a fixed size,
  currently 1000 entries.  If it is filled up, the consistency
  will be broken.  Undo log is recorded just once after the
  latest undo-recording transaction comes. It is likely to be
  read in rather short-lived transactions and it's likely that
  there's no more than several such transactions
  simultaneously. It's possible to provide dynamic resizing
  feature but it doesn't seem worth the complexity.

- Undoing runs linear search on the ring buffer. It is done at
  the first time when the stats for every relation is
  accessed. It can be (a bit) slow when many log entriess
  resides. (Concurrent vacuum can generate many undo log
  entries.)

- Undo logs for other stats doesn't seem to me to be needed,
  but..


A=>: select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname |  seq_scan
 t1     |         0

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname |  seq_scan
 t2     |         0

A=> BEGIN;

-- This gives effect because no stats access has been made
B=> select * from t1;
B=> select * from t2;

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname |  seq_scan
 t1     |         1

-- This has no effect because undo logging is now working
B=> select * from t1;
B=> select * from t2;

<repeat two times>

-- This is the second time in this xact to request for t1,
-- just returns cached result.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname |  seq_scan
 t1     |         1

-- This is the first time in this xact to request for t2. The
-- result is undoed one.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname |  seq_scan
 t2     |         1
A=> COMMIT;

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname |  seq_scan
 t1     |         4
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname |  seq_scan
 t2     |         4

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c055ec1d58664609607e352b49b12a9d41b53465 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 5 Jul 2018 19:36:15 +0900
Subject: [PATCH 7/8] Fix of v3-0003-dshash-based-staas-collector

Trivial bug fix. But not melded into it for now.
---
 src/backend/postmaster/pgstat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ef283f17ab..af97e6b46b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -5078,7 +5078,7 @@ done:
 Size
 StatsShmemSize(void)
 {
-    return sizeof(dsa_handle);
+    return sizeof(StatsShmemStruct);
 }
 
 void
-- 
2.16.3

From 0e45c21801c81f35c8718f7d2eafc8b45abeae4c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 5 Jul 2018 18:50:27 +0900
Subject: [PATCH 8/8] PoC implement of undo log implement

To keep consistency of stats data within a transaction, this patch
adds undo log feature to stats collector. It collects undo log if any
backend needs it. The undo log is of a fixed size (1000) and if it is
filled up, consistency is no longer kept. In the common case stats is
not refered in long-lived transaction and the consistency gets trivial
in such a long term.
---
 src/backend/postmaster/autovacuum.c |  16 +--
 src/backend/postmaster/pgstat.c     | 222 ++++++++++++++++++++++++++++++++----
 src/backend/utils/adt/pgstatfuncs.c |  42 +++----
 src/include/pgstat.h                |   2 +-
 4 files changed, 227 insertions(+), 55 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 74e3ab6167..dc911a7952 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -981,7 +981,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, false);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -1005,7 +1005,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, false);
         if (entry == NULL)
             continue;
 
@@ -1029,7 +1029,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, false);
         if (entry == NULL)
             continue;
 
@@ -1239,7 +1239,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, false);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1975,7 +1975,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2025,7 +2025,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, false);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2806,8 +2806,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index af97e6b46b..3e16718057 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -92,6 +92,8 @@
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/* Maximum number of stats undo entries */
+#define INIT_NUM_UNDOLOGS 1000
 
 /* ----------
  * Total number of backends including auxiliary
@@ -133,18 +135,41 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+typedef struct PgStat_TabUndoLogEnt
+{
+    Oid        dboid;
+    PgStat_StatTabEntry ent;
+}  PgStat_TabUndoLogEnt;
+
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
     dsa_handle stats_dsa_handle;
     dshash_table_handle db_stats_handle;
     dsa_pointer    global_stats;
     dsa_pointer    archiver_stats;
+    /* Undo log stuff */
+    int            nundoers;
+    dsa_pointer undoarea;
+    uint32        nundologs;
+    uint32        undoinsptr;
+    uint32        undoinsround;
+    uint32        lastundopos;
 } StatsShmemStruct;
 
-static StatsShmemStruct * StatsShmem = NULL;
+typedef struct PgStat_RelIdEnt
+{
+    Oid    dboid;
+    Oid    reloid;
+} PgStat_RelIdEnt;
+
+static bool   StatsUndoing = false;    /* Is this backend needs undoing? */
+static uint32 StatsUndoPtr = 0;        /* The first undo log this backend sees */
+static uint32 StatsUndoRound = 0;    /* The undo round for this backend */
+static StatsShmemStruct *StatsShmem = NULL;
 static dsa_area *area = NULL;
 static dshash_table *db_stats;
 static HTAB *snapshot_db_stats;
+static HTAB *undo_logged_tables = NULL;    /* to remember undo logged tables */
 
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
@@ -169,6 +194,11 @@ static const dshash_parameters dsh_funcparams = {
     LWTRANCHE_STATS_FUNC_TABLE
 };
 
+typedef struct {
+    Oid                    dbid;
+    PgStat_StatTabEntry    ent;
+} PgStat_StatTabUndo;
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -274,7 +304,7 @@ static PgStat_ArchiverStats *shared_archiverStats;
 static PgStat_ArchiverStats *snapshot_archiverStats;
 static PgStat_GlobalStats *shared_globalStats;
 static PgStat_GlobalStats *snapshot_globalStats;
-
+static PgStat_TabUndoLogEnt *undodata;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -318,7 +348,7 @@ static void pgstat_read_statsfiles(void);
 static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
 /* functions used in backends */
-static bool backend_snapshot_database_stats(void);
+static bool backend_snapshot_database_stats(bool snapshot);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
@@ -996,6 +1026,10 @@ pgstat_create_shared_stats(void)
         dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
     StatsShmem->db_stats_handle =
         dshash_get_hash_table_handle(db_stats);
+    StatsShmem->nundologs = INIT_NUM_UNDOLOGS;
+    StatsShmem->undoarea =
+        dsa_allocate0(area, sizeof(PgStat_TabUndoLogEnt) * INIT_NUM_UNDOLOGS);
+    StatsShmem->undoinsptr = 0;
 
     /* connect to the memory */
     snapshot_db_stats = NULL;
@@ -1003,6 +1037,8 @@ pgstat_create_shared_stats(void)
         dsa_get_address(area, StatsShmem->global_stats);
     shared_archiverStats = (PgStat_ArchiverStats *)
         dsa_get_address(area, StatsShmem->archiver_stats);
+    undodata = (PgStat_TabUndoLogEnt *)
+        dsa_get_address(area, StatsShmem->undoarea);
     MemoryContextSwitchTo(oldcontext);
     LWLockRelease(StatsLock);
 }
@@ -1031,7 +1067,7 @@ pgstat_vacuum_stat(void)
         return;
 
     /* If not done for this transaction, take a snapshot of stats */
-    if (!backend_snapshot_database_stats())
+    if (!backend_snapshot_database_stats(false))
         return;
 
     /*
@@ -2365,10 +2401,10 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
  * ----------
  */
 PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
+pgstat_fetch_stat_dbentry(Oid dbid, bool snapshot)
 {
     /* If not done for this transaction, take a stats snapshot */
-    if (!backend_snapshot_database_stats())
+    if (!backend_snapshot_database_stats(snapshot))
         return NULL;
 
     /*
@@ -2395,7 +2431,7 @@ pgstat_fetch_stat_tabentry(Oid relid)
     PgStat_StatTabEntry *tabentry;
 
     /* Lookup our database, then look in its table hash table. */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
     if (dbentry == NULL)
         return NULL;
 
@@ -2406,7 +2442,7 @@ pgstat_fetch_stat_tabentry(Oid relid)
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, true);
     if (dbentry == NULL)
         return NULL;
 
@@ -2432,11 +2468,11 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatFuncEntry *funcentry = NULL;
 
     /* If not done for this transaction, take a stats snapshot */
-    if (!backend_snapshot_database_stats())
+    if (!backend_snapshot_database_stats(IsTransactionState()))
         return NULL;
 
     /* Lookup our database, then find the requested function */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
     if (dbentry == NULL)
         return NULL;
 
@@ -2519,7 +2555,7 @@ PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
     /* If not done for this transaction, take a stats snapshot */
-    if (!backend_snapshot_database_stats())
+    if (!backend_snapshot_database_stats(IsTransactionState()))
         return NULL;
 
     return snapshot_archiverStats;
@@ -2538,7 +2574,7 @@ PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
     /* If not done for this transaction, take a stats snapshot */
-    if (!backend_snapshot_database_stats())
+    if (!backend_snapshot_database_stats(IsTransactionState()))
         return NULL;
 
     return snapshot_globalStats;
@@ -3908,7 +3944,6 @@ pgstat_get_wait_io(WaitEventIO w)
     return event_name;
 }
 
-
 /* ----------
  * pgstat_get_backend_current_activity() -
  *
@@ -5293,6 +5328,22 @@ backend_clean_snapshot_callback(void *arg)
     snapshot_globalStats = NULL;
     snapshot_archiverStats = NULL;
     snapshot_db_stats = NULL;
+    if (StatsUndoing)
+    {
+        Assert(StatsShmem->nundoers > 0);
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        StatsShmem->nundoers--;
+
+        /* If I was the last undoer, reset shared pointers */
+        if (StatsShmem->nundoers == 0)
+        {
+            StatsShmem->undoinsptr = 0;
+            StatsShmem->undoinsround = 0;
+        }
+        LWLockRelease(StatsLock);
+        StatsUndoing = false;
+    }
 }
 
 /*
@@ -5363,7 +5414,8 @@ static void *
 snapshot_statentry(HTAB **dest,
                    dshash_table_handle dsh_handle,
                    const dshash_parameters *dsh_params,
-                   Oid key, size_t keysize, size_t entrysize)
+                   Oid key, size_t keysize, size_t entrysize,
+                   bool *found_in_cache)
 {
     void *lentry = NULL;
 
@@ -5409,12 +5461,13 @@ snapshot_statentry(HTAB **dest,
         }
         dshash_detach(t);
 
+        if (found_in_cache)
+            *found_in_cache = true;
+
         return NULL;
     }
     else if (dest)
     {
-        bool found;
-
         /*
          * Create new hash for entry cache. Make with arbitrary initial
          * entries since we don't know how this hash will grow.
@@ -5422,8 +5475,8 @@ snapshot_statentry(HTAB **dest,
         if (!*dest)
             *dest = create_local_stats_hash(keysize, entrysize, 32);
 
-        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
-        if (!found)
+        lentry = hash_search(*dest, &key, HASH_ENTER, found_in_cache);
+        if (!*found_in_cache)
         {
             dshash_table *t = dshash_attach(area, dsh_params, dsh_handle, NULL);
             void *sentry = dshash_find(t, &key, false);
@@ -5460,6 +5513,10 @@ snapshot_statentry(HTAB **dest,
             memcpy(lentry, sentry, entrysize);
             dshash_release_lock(t, sentry);
         }
+        
+        if (found_in_cache)
+            *found_in_cache = false;
+
         dshash_detach(t);
     }
     
@@ -5475,7 +5532,7 @@ snapshot_statentry(HTAB **dest,
  * Returns false if the shared stats is not created yet.
  */
 static bool
-backend_snapshot_database_stats(void)
+backend_snapshot_database_stats(bool snapshot)
 {
     MemoryContext oldcontext;
     MemoryContextCallback *mcxt_cb;
@@ -5515,6 +5572,18 @@ backend_snapshot_database_stats(void)
     /* set the timestamp of this snapshot */
     snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
 
+    if (snapshot && IsTransactionState())
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        StatsUndoing = true;
+        StatsShmem->nundoers++;
+        StatsShmem->lastundopos = StatsShmem->undoinsptr;
+        StatsUndoPtr = StatsShmem->undoinsptr;
+        StatsUndoRound = StatsShmem->undoinsround;
+        LWLockRelease(StatsLock);
+        hash_destroy(undo_logged_tables);
+        undo_logged_tables = NULL;
+    }    
     take_db_stats_snapshot();
 
     /* register callback to clear snapshot */
@@ -5537,12 +5606,63 @@ backend_snapshot_database_stats(void)
 PgStat_StatTabEntry *
 backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
 {
+    bool found_in_cache;
+    uint32 myround, round, start, end, nent, nundologs;
+    PgStat_TabUndoLogEnt *undolist, *p;
+    int i;
+
     /* take a local snapshot if we don't have one */
-    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
-                              dbent->tables, &dsh_tblparams,
-                              reloid,
-                              dsh_tblparams.key_size,
-                              dsh_tblparams.entry_size);
+    PgStat_StatTabEntry *ent =
+        snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                           dbent->tables, &dsh_tblparams,
+                           reloid,
+                           dsh_tblparams.key_size,
+                           dsh_tblparams.entry_size,
+                           &found_in_cache);
+
+    /* Just return the result if caching is not required */
+    if (oneshot || found_in_cache || reloid == InvalidOid || !StatsUndoing)
+        return ent;
+
+    /* Search for undo list */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    round = StatsShmem->undoinsround;
+    end = StatsShmem->undoinsptr;
+    nundologs = StatsShmem->nundologs;
+    LWLockRelease(StatsLock);
+
+    myround = StatsUndoRound;
+    start = StatsUndoPtr;
+
+    /* Check for undo wrap around*/
+    if (myround <= round)
+        nent = (round - myround) * nundologs + (end - start);
+    else
+        nent = (nundologs - (myround - round)) * nundologs + (end - start);
+
+    undolist = (PgStat_TabUndoLogEnt *)
+        dsa_get_address(area, StatsShmem->undoarea);
+    if (nent > nundologs)
+    {
+        elog(WARNING, "Stats undo list wrap arounded. Older state is lost");
+        return ent;
+    }
+
+    for (p = undolist + start, i = 0 ; i < nent ; i++)
+    {
+        if (p->dboid == dbent->databaseid &&
+            p->ent.tableid == reloid)
+        {
+            memcpy(ent, &p->ent, sizeof(PgStat_StatTabEntry));
+            break;
+        }
+        if (start + i < nundologs)
+            p++;
+        else
+            p = undolist;
+    }
+
+    return ent;
 }
 
 /* ----------
@@ -5559,7 +5679,8 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               dbent->functions, &dsh_funcparams,
                               funcid,
                               dsh_funcparams.key_size,
-                              dsh_funcparams.entry_size);
+                              dsh_funcparams.entry_size,
+                              NULL);
 }
 
 /* ----------
@@ -5610,6 +5731,56 @@ pgstat_clear_snapshot(void)
     backend_clean_snapshot_callback(¶m);
 }
 
+/*
+ * record_stats_undo_log() -
+ *
+ * Stores table stats undo log.
+ */
+static void
+record_stats_undo_log(Oid dboid, PgStat_StatTabEntry *tabentry)
+{
+    int inspos;
+    PgStat_RelIdEnt db_relid;
+    bool    found = false;
+
+    /* No lock needed since this check doesn't need so strict */
+    Assert(StatsShmem->undoinsptr < StatsShmem->nundologs);
+    if (StatsShmem->nundoers == 0)
+        return;
+
+    /*
+     * We need at most one undo entry for a relation since the last undoer
+     * comes. undo_logged_tables is cleard when new undoer comes.
+     */
+    if (!undo_logged_tables)
+    {
+        HASHCTL        ctl;
+
+        /* Create undo record hash */
+        ctl.keysize = ctl.entrysize = sizeof(PgStat_RelIdEnt);
+        undo_logged_tables = hash_create("pgstat undo record hash",
+                                    128, &ctl, HASH_ELEM | HASH_BLOBS);
+    }
+    db_relid.dboid = dboid;
+    db_relid.reloid = tabentry->tableid;
+    hash_search(undo_logged_tables, &db_relid, HASH_ENTER, &found);
+
+    if (found)
+        return;
+
+    inspos = StatsShmem->undoinsptr;
+    undodata[inspos].dboid = dboid;
+    memcpy(&undodata[inspos].ent, tabentry, sizeof(PgStat_TabUndoLogEnt));
+
+    /* Expose the entry just entered. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (++StatsShmem->undoinsptr >= StatsShmem->nundologs)
+    {
+        StatsShmem->undoinsptr = 0;
+        StatsShmem->undoinsround++; /* Don't care for overflow */
+    }
+    LWLockRelease(StatsLock);
+}
 
 /* ----------
  * pgstat_recv_tabstat() -
@@ -5677,6 +5848,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         }
         else
         {
+            record_stats_undo_log(msg->m_databaseid, tabentry);
             /*
              * Otherwise add the values to the existing entry.
              */
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..62aa520c1e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, true)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88bb1e636b..438a9e2fb9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1311,7 +1311,7 @@ extern void pgstat_send_bgwriter(void);
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid, bool snapshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
-- 
2.16.3


Re: shared-memory based stats collector

От
Magnus Hagander
Дата:


On Wed, Jul 4, 2018 at 11:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
> At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
>> Copying the whole hash table kinds of sucks, partly because of the
>> time it will take to copy it, but also because it means that memory
>> usage is still O(nbackends * ntables).  Without looking at the patch,
>> I'm guessing that you're doing that because we need a way to show each
>> transaction a consistent snapshot of the data, and I admit that I
>> don't see another obvious way to tackle that problem.  Still, it would
>> be nice if we had a better idea.

> The consistency here means "repeatable read" of an object's stats
> entry, not a snapshot covering all objects. We don't need to copy
> all the entries at once following this definition. The attached
> version makes a cache entry only for requested objects.

Uh, what?  That's basically destroying the long-standing semantics of
statistics snapshots.  I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

I agree that this is definitely something that needs to be considered. I took a look some time ago at the same thing, and ran up against exactly that one (and at the time did not have time to fix it).

I have not yet had time to look at the downstream suggested handling (UNDO). However, I had one other thing from my notes I wanted to mention :)

We should probably consider adding an API to fetch counters that *don't* follow these rules, in case it's not needed. When going through files we're still stuck at that bottleneck, but if going through shared memory it should be possible to make it a lot cheaper by volunteering to "not need that". 

We should also consider the ability to fetch stats for a single object, which would require no copying of the whole structure at all. I think something like this could for example be used for autovacuum rechecks. On top of the file based transfer that would help very little, but through shared memory it could be a lot lighter weight.

--

Re: shared-memory based stats collector

От
Robert Haas
Дата:
On Fri, Jul 6, 2018 at 10:29 AM, Magnus Hagander <magnus@hagander.net> wrote:
> I agree that this is definitely something that needs to be considered. I
> took a look some time ago at the same thing, and ran up against exactly that
> one (and at the time did not have time to fix it).
>
> I have not yet had time to look at the downstream suggested handling (UNDO).
> However, I had one other thing from my notes I wanted to mention :)
>
> We should probably consider adding an API to fetch counters that *don't*
> follow these rules, in case it's not needed. When going through files we're
> still stuck at that bottleneck, but if going through shared memory it should
> be possible to make it a lot cheaper by volunteering to "not need that".
>
> We should also consider the ability to fetch stats for a single object,
> which would require no copying of the whole structure at all. I think
> something like this could for example be used for autovacuum rechecks. On
> top of the file based transfer that would help very little, but through
> shared memory it could be a lot lighter weight.

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost.  I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal.  However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Note that commit 3cba8999b34 relaxed the synchronization requirements
around GetLockStatusData().  In other words, since 2011, you can no
longer be certain that 'select * from pg_locks' is returning a
perfectly consistent view of the lock status.  If this has caused
anybody a major problem, I'm unaware of it.  Maybe the same would end
up being true here.  The amount of memory we're consuming for this
data may be a bigger problem than minor inconsistencies in the view of
the data would be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: shared-memory based stats collector

От
Andres Freund
Дата:
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
> I think we also have to ask ourselves in general whether snapshots of
> this data are worth what they cost.  I don't think anyone would doubt
> that a consistent snapshot of the data is better than an inconsistent
> view of the data if the costs were equal.  However, if we can avoid a
> huge amount of memory usage and complexity on large systems with
> hundreds of backends by ditching the snapshot requirement, then we
> should ask ourselves how important we think the snapshot behavior
> really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
"Joshua D. Drake"
Дата:
On 07/06/2018 11:57 AM, Andres Freund wrote:
> On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
>> I think we also have to ask ourselves in general whether snapshots of
>> this data are worth what they cost.  I don't think anyone would doubt
>> that a consistent snapshot of the data is better than an inconsistent
>> view of the data if the costs were equal.  However, if we can avoid a
>> huge amount of memory usage and complexity on large systems with
>> hundreds of backends by ditching the snapshot requirement, then we
>> should ask ourselves how important we think the snapshot behavior
>> really is.
> Indeed. I don't think it's worthwhile major additional memory or code
> complexity in this situation. The likelihood of benefitting from more /
> better stats seems far higher than a more accurate view of the stats -
> which aren't particularly accurate themselves. They don't even survive
> crashes right now, so I don't think the current accuracy is very high.


Will stats, if we move toward the suggested changes be "less" accurate 
than they are now? We already know that stats are generally not accurate 
but they are close enough. If we move toward this change will it still 
be close enough?

JD

>
> Greetings,
>
> Andres Freund
>

-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****



Re: shared-memory based stats collector

От
Robert Haas
Дата:
On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> Will stats, if we move toward the suggested changes be "less" accurate than
> they are now? We already know that stats are generally not accurate but they
> are close enough. If we move toward this change will it still be close
> enough?

There proposed change would have no impact at all on the long-term
accuracy of the statistics.  It would just mean that there would be
race conditions when reading them, so that for example you would be
more likely to see a count of heap scans that doesn't match the count
of index scans, because an update arrives in between when you read the
first value and when you read the second one.  I don't see that
mattering a whole lot, TBH, but maybe I'm missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: shared-memory based stats collector

От
Andres Freund
Дата:
On 2018-07-06 12:02:39 -0700, Joshua D. Drake wrote:
> On 07/06/2018 11:57 AM, Andres Freund wrote:
> > On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
> > > I think we also have to ask ourselves in general whether snapshots of
> > > this data are worth what they cost.  I don't think anyone would doubt
> > > that a consistent snapshot of the data is better than an inconsistent
> > > view of the data if the costs were equal.  However, if we can avoid a
> > > huge amount of memory usage and complexity on large systems with
> > > hundreds of backends by ditching the snapshot requirement, then we
> > > should ask ourselves how important we think the snapshot behavior
> > > really is.
> > Indeed. I don't think it's worthwhile major additional memory or code
> > complexity in this situation. The likelihood of benefitting from more /
> > better stats seems far higher than a more accurate view of the stats -
> > which aren't particularly accurate themselves. They don't even survive
> > crashes right now, so I don't think the current accuracy is very high.
> 
> 
> Will stats, if we move toward the suggested changes be "less" accurate than
> they are now? We already know that stats are generally not accurate but they
> are close enough. If we move toward this change will it still be close
> enough?

I don't think there's a meaningful difference to before. And at the same
time less duplication / hardcoded structure will allow us to increase
the amount of stats we keep.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
"Joshua D. Drake"
Дата:
On 07/06/2018 12:34 PM, Robert Haas wrote:
> On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> Will stats, if we move toward the suggested changes be "less" accurate than
>> they are now? We already know that stats are generally not accurate but they
>> are close enough. If we move toward this change will it still be close
>> enough?
> There proposed change would have no impact at all on the long-term
> accuracy of the statistics.  It would just mean that there would be
> race conditions when reading them, so that for example you would be
> more likely to see a count of heap scans that doesn't match the count
> of index scans, because an update arrives in between when you read the
> first value and when you read the second one.  I don't see that
> mattering a whole lot, TBH, but maybe I'm missing something.

I agree that it probably isn't a big deal. Generally speaking when we 
look at stats it is to get an "idea" of what is going on. We don't care 
if we are missing an increase/decrease of 20 of any particular value 
within stats. Based on this and what Andres said, it seems like a net 
win to me.

JD

>

-- 
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****



Re: shared-memory based stats collector

От
Magnus Hagander
Дата:
On Fri, Jul 6, 2018 at 8:57 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
> I think we also have to ask ourselves in general whether snapshots of
> this data are worth what they cost.  I don't think anyone would doubt
> that a consistent snapshot of the data is better than an inconsistent
> view of the data if the costs were equal.  However, if we can avoid a
> huge amount of memory usage and complexity on large systems with
> hundreds of backends by ditching the snapshot requirement, then we
> should ask ourselves how important we think the snapshot behavior
> really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Definitely agreed.

*If* we can provide the snapshots view of them without too much overhead I think it's worth looking into that while *also* proviiding a lower overhead interface for those that don't care about it.

If it ends up that keeping the snapshots become too much overhead in either in performance or code-maintenance, then I agree can probably drop that. But we should at least properly investigate the cost. 

--

Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
> *If* we can provide the snapshots view of them without too much overhead I
> think it's worth looking into that while *also* proviiding a lower overhead
> interface for those that don't care about it.

I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.


> If it ends up that keeping the snapshots become too much overhead in either
> in performance or code-maintenance, then I agree can probably drop that.
> But we should at least properly investigate the cost.

I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. Thanks for the opinions.

At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>
> Hi,
> 
> On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
> > *If* we can provide the snapshots view of them without too much overhead I
> > think it's worth looking into that while *also* proviiding a lower overhead
> > interface for those that don't care about it.
> 
> I don't see how that's possible without adding significant amounts of
> complexity and probably memory / cpu overhead. The current stats already
> are quite inconsistent (often outdated, partially updated, messages
> dropped when busy) - I don't see what we really gain by building
> something MVCC like in the "new" stats subsystem.
> 
> 
> > If it ends up that keeping the snapshots become too much overhead in either
> > in performance or code-maintenance, then I agree can probably drop that.
> > But we should at least properly investigate the cost.
> 
> I don't think it's worthwhile to more than think a bit about it. There's
> fairly obvious tradeoffs in complexity here. Trying to get there seems
> like a good way to make the feature too big.

Agreed.

Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 889237d7381d334df3a35e1fa5350298352fb9fe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h |  23 +++++++-
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..5b133226ac 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+    status->curpartition = -1;
+    status->consistent = consistent;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
+    }
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+        status->hash_table->find_locked = true;
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+        Assert(status->hash_table->find_locked);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+        /*
+         * we need a lock on the scanning partition even if the caller don't
+         * requested a consistent snapshot.
+         */
+        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        {
+            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+                                                       next_item_pointer);
+            int next_partition = PARTITION_FOR_HASH(item->hash);
+            if (status->curpartition != next_partition)
+            {
+                if (status->curpartition >= 0)
+                    LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                                 status->curpartition));
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              LW_SHARED);
+                status->curpartition = next_partition;
+            }
+        }
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->find_locked);
+    status->hash_table->find_locked = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table, false);
+    while ((p = dshash_seq_next(&s)) != NULL)
+        n++;
+
+    return n;
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8598d9ed84 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -15,6 +15,7 @@
 #define DSHASH_H
 
 #include "utils/dsa.h"
+#include "utils/hsearch.h"
 
 /* The opaque type representing a hash table. */
 struct dshash_table;
@@ -59,6 +60,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    int                    curpartition;
+    bool                consistent;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From f1b7c2524e4fb86437411190594d858f6e2fc45b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 2/6] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |   8 ++
 src/backend/postmaster/pgstat.c     | 158 +++++++++++-------------------------
 src/backend/postmaster/postmaster.c |  30 +++----
 src/include/miscadmin.h             |   3 +-
 src/include/pgstat.h                |  11 ++-
 5 files changed, 77 insertions(+), 133 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee63e..8f8327495a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -462,6 +465,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index bbe73618c7..3193178f32 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -688,104 +689,6 @@ pgstat_reset_all(void)
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 void
 allow_immediate_pgstat_restart(void)
 {
@@ -2870,6 +2773,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4135,6 +4041,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4252,8 +4161,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4266,8 +4175,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4312,14 +4221,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4507,14 +4416,29 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4529,6 +4453,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b53b33cd..a37a18696e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -145,7 +145,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -551,6 +552,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1760,7 +1762,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2878,7 +2880,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2951,7 +2953,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3037,10 +3039,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3270,7 +3272,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -4949,12 +4951,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         PgArchiverMain(argc, argv); /* does not return */
     }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5071,7 +5067,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5361,6 +5357,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e167ee8fcb..53b260cb1f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,7 +400,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
@@ -412,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae23..e97b25bd72 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1160,11 +1161,6 @@ extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
 extern void allow_immediate_pgstat_restart(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
-
 /* ----------
  * Functions called from backends
  * ----------
@@ -1353,4 +1349,7 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* Main loop */
+extern void PgstatCollectorMain(void) pg_attribute_noreturn();
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From f962c7de80d68796c0cfa96ff1dd33bde5efa36f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 17:05:46 +0900
Subject: [PATCH 3/6] dshash-based stats collector

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash. The stats entries are cached
one by one to give a consistent snapshot during a transaction. On the
other hand vacuum no longer take a complete cache of stats.

This patch removes PG_STAT_TMP_DIR and GUC stats_temp_directory.  That
affects pg_basebackup and pg_stat_statements but this patch fixes only
pg_basbackup. Fix for pg_stat_statements is done in another patch.
---
 src/backend/postmaster/autovacuum.c           |   59 +-
 src/backend/postmaster/pgstat.c               | 1566 ++++++++++++-------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/pgstat.h                          |   50 +-
 src/include/storage/lwlock.h                  |    3 +
 12 files changed, 836 insertions(+), 929 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 02e6d8131e..87a2ffc20f 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -981,7 +981,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = backend_get_db_entry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -990,6 +990,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -1005,7 +1006,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = backend_get_db_entry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1017,6 +1018,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1029,7 +1031,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = backend_get_db_entry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1041,6 +1043,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1239,7 +1242,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = backend_get_db_entry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1277,16 +1280,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1975,7 +1984,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2025,7 +2034,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = backend_get_db_entry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2114,6 +2123,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2193,10 +2204,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2768,12 +2780,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2805,8 +2815,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = backend_get_db_entry(InvalidOid, true);
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2837,6 +2847,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2927,7 +2939,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3193178f32..93ffdfe5aa 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -77,22 +77,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +89,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -127,14 +114,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -154,6 +133,43 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +266,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,23 +305,23 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+static void pgstat_read_current_status(void);
 
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
@@ -320,7 +340,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +704,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -915,6 +933,81 @@ pgstat_send_funcstats(void)
 }
 
 
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* top level varialbles. lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -924,10 +1017,11 @@ pgstat_send_funcstats(void)
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
+    HTAB       *oidtab;
     PgStat_MsgTabpurge msg;
     PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
@@ -936,23 +1030,22 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId);
+    oidtab = pgstat_collect_oids(DatabaseRelationId);
 
     /*
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -960,26 +1053,24 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
+    if (dbentry == NULL)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId);
+    oidtab = pgstat_collect_oids(RelationRelationId);
 
     /*
      * Initialize our messages table counter to zero
@@ -988,15 +1079,17 @@ pgstat_vacuum_stat(void)
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
         /*
@@ -1019,6 +1112,7 @@ pgstat_vacuum_stat(void)
             msg.m_nentries = 0;
         }
     }
+    dshash_detach(dshtable);
 
     /*
      * Send the rest
@@ -1034,29 +1128,29 @@ pgstat_vacuum_stat(void)
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    if (dshash_get_num_entries(dshtable) > 0)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId);
+        oidtab = pgstat_collect_oids(ProcedureRelationId);
 
         pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
             /*
@@ -1089,8 +1183,9 @@ pgstat_vacuum_stat(void)
             pgstat_send(&f_msg, len);
         }
 
-        hash_destroy(htab);
+        hash_destroy(oidtab);
     }
+    dshash_detach(dshtable);
 }
 
 
@@ -1457,24 +1552,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2289,18 +2366,10 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    PgStat_StatDBEntry *dbentry;
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    dbentry = backend_get_db_entry(dbid, false);
+    return dbentry;
 }
 
 
@@ -2316,47 +2385,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = backend_get_db_entry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = backend_get_db_entry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2375,17 +2425,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+    if (funcentry == NULL)
+        return NULL;
 
     return funcentry;
 }
@@ -2461,9 +2508,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2478,9 +2527,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4186,18 +4237,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4239,13 +4286,6 @@ PgstatCollectorMain(void)
                 ProcessConfigFile(PGC_SIGHUP);
             }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
             /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
@@ -4294,10 +4334,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4386,9 +4422,7 @@ PgstatCollectorMain(void)
          * fixes that, so don't sleep indefinitely.  This is a crock of the
          * first water, but until somebody wants to debug exactly what's
          * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * timeout matches our pre-9.2 behavior.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4408,7 +4442,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4466,14 +4500,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4499,20 +4533,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4525,15 +4556,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4545,23 +4579,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4599,25 +4633,20 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * pgstat_write_statsfiles() -
  *        Write the global statistics file, as well as requested DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
  *    When 'allDbs' is false, only the requested databases (listed in
  *    pending_write_requests) will be written; otherwise, all databases
  *    will be written.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4638,7 +4667,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4650,32 +4679,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4719,9 +4745,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
     /*
      * Now throw away the list of requests.  Note that requests sent after we
      * started the write are still waiting on the network socket.
@@ -4735,15 +4758,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed > len)
@@ -4761,10 +4783,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4773,9 +4795,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4802,24 +4825,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl, false);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4853,76 +4880,45 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4940,7 +4936,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -4957,11 +4953,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -4972,17 +4968,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5011,12 +5006,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5024,8 +5019,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5033,47 +5028,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5090,34 +5061,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5128,7 +5112,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5187,12 +5172,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5200,6 +5186,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5221,9 +5208,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5234,6 +5221,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5253,276 +5241,355 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * snapshot_statentry_all() - Take a snapshot of all shared stats entries
+ *
+ * Returns a local hash contains all entries in the shared stats.
+ *
+ * The given dshash is used if any. Elsewise temporarily attach dsh_handle.
+ */
+static HTAB *
+snapshot_statentry_all(const char *hashname,
+                       dshash_table *dshash, dshash_table_handle dsh_handle,
+                       const dshash_parameters *dsh_params)
+{
+    dshash_table *t;
+    dshash_seq_status s;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+    void *ps;
+    int num_entries;
+    HTAB *dest;
+
+    t = dshash ? dshash : dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+    /*
+     * No need to create new hash if no entry exists. The number can change
+     * after this, but dynahash can store extra entries in the case.
+     */
+    num_entries = dshash_get_num_entries(t);
+    if (num_entries == 0)
+    {
+        dshash_detach(t);
+        return NULL;
+    }
+
+    Assert(hashname);
+    dest = create_local_stats_hash(hashname,
+                                    keysize, entrysize, num_entries);
+
+    dshash_seq_init(&s, t, true);
+    while ((ps = dshash_seq_next(&s)) != NULL)
+    {
+        bool found;
+        void *pd = hash_search(dest, ps, HASH_ENTER, &found);
+        Assert(!found);
+        memcpy(pd, ps, entrysize);
+        /* dshash_seq_next releases entry lock automatically */
+    }
+
+    if (!dshash)
+        dshash_detach(t);
+
+    return dest;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(!pgStatRunningInCollector);
+
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * backend_get_db_entry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+backend_get_db_entry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                              hashname, db_stats, 0, &dsh_dbparams,
+                              dbid);
+}
+
+/* ----------
+ * backend_snapshot_all_db_entries() -
+ *
+ *    Take a snapshot of all databsae stats at once into returned hash.
+ */
+HTAB *
+backend_snapshot_all_db_entries(void)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_statentry_all(hashname, db_stats, 0, &dsh_dbparams);
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5553,6 +5620,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5562,99 +5631,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
 
     /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5667,6 +5649,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5682,6 +5665,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5689,9 +5673,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5750,6 +5733,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5762,6 +5746,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5774,27 +5760,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5820,23 +5812,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5864,19 +5853,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5891,14 +5889,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -5928,11 +5926,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5952,6 +5958,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5965,13 +5973,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -5986,6 +5994,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5999,13 +6010,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6028,6 +6041,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6043,18 +6059,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6067,16 +6083,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6117,6 +6133,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6133,6 +6151,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6150,6 +6170,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6161,6 +6183,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6169,14 +6192,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6197,7 +6220,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6209,6 +6236,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6217,60 +6245,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3f1eae38a9..a517bf62b6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..ee30e8a14f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17292e04fe..2f20035523 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -186,7 +186,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3755,17 +3754,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10691,35 +10679,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9e39baf466..7aa57bc489 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -509,7 +509,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ae22e7d9fb..0c3b82b455 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -216,7 +216,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e97b25bd72..afc1927250 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1213,6 +1183,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1352,4 +1325,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 /* Main loop */
 extern void PgstatCollectorMain(void) pg_attribute_noreturn();
 
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..2cdd10c2fd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.16.3

From a780f467b3352b7eff29339d09bdc60325524c41 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 4/6] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5b913f00c1..8430c1a3cb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6081,25 +6081,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..a01f68e99b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15889,8 +15889,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0484cfa77a..bd50efcec8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index cfc805f785..4b8ea2a6b8 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2624,7 +2624,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index e2662bbf81..bf9c5dd580 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -270,8 +270,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From 9c757eeb0b0212909493f540f8aa55116f3fdbc3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 10:59:17 +0900
Subject: [PATCH 5/6] Let pg_stat_statements not to use PG_STAT_TMP_DIR.

This patchset removes the definition because pg_stat.c no longer uses
the directory and no other sutable module to pass it over. As a
tentative solution this patch moves query text file into permanent
stats directory. pg_basebackup and pg_rewind are conscious of the
directory. They currently omit the text file but becomes to copy it by
this change.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index cc9efab243..cdff585e76 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
-- 
2.16.3

From 3f3bff14680fd752268719540c45c5fad4150fd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:46:43 +0900
Subject: [PATCH 6/6] Remove pg_stat_tmp exclusion from pg_rewind

The directory "pg_stat_tmp" no longer exists so remove it from the
exclusion list.
---
 src/bin/pg_rewind/filemap.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 8f49d34652..a849e62558 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, const char *type);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
-- 
2.16.3


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On 07/10/2018 02:07 PM, Kyotaro HORIGUCHI wrote:
> Hello. Thanks for the opinions.
> 
> At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>
>> Hi,
>>
>> On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
>>> *If* we can provide the snapshots view of them without too much overhead I
>>> think it's worth looking into that while *also* proviiding a lower overhead
>>> interface for those that don't care about it.
>>
>> I don't see how that's possible without adding significant amounts of
>> complexity and probably memory / cpu overhead. The current stats already
>> are quite inconsistent (often outdated, partially updated, messages
>> dropped when busy) - I don't see what we really gain by building
>> something MVCC like in the "new" stats subsystem.
>>
>>
>>> If it ends up that keeping the snapshots become too much overhead in either
>>> in performance or code-maintenance, then I agree can probably drop that.
>>> But we should at least properly investigate the cost.
>>
>> I don't think it's worthwhile to more than think a bit about it. There's
>> fairly obvious tradeoffs in complexity here. Trying to get there seems
>> like a good way to make the feature too big.
> 
> Agreed.
> 
> Well, if we allow to lose consistency in some extent for improved
> performance and smaller footprint, relaxing the consistency of
> database stats can reduce footprint further especially on a
> cluster with so many databases. Backends are interested only in
> the residing database and vacuum doesn't cache stats at all. A
> possible problem is vacuum and stats collector can go into a race
> condition. I'm not sure but I suppose it is not worse than being
> involved in an IO congestion.
> 

As someone who regularly analyzes stats collected from user systems, I 
think there's certainly some value with keeping the snapshots reasonably 
consistent. But I agree it doesn't need to be perfect, and some level of 
inconsistency is acceptable (and the amount of complexity/overhead 
needed to maintain perfect consistency seems rather excessive here).

There's one more reason why attempts to keep stats snapshots "perfectly" 
consistent are likely doomed to fail - the messages are sent over UDP, 
which does not guarantee delivery etc. So there's always some level of 
possible inconsistency even with "perfectly consistent" snapshots.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Andres Freund
Дата:
On 2018-07-10 14:52:13 +0200, Tomas Vondra wrote:
> There's one more reason why attempts to keep stats snapshots "perfectly"
> consistent are likely doomed to fail - the messages are sent over UDP, which
> does not guarantee delivery etc. So there's always some level of possible
> inconsistency even with "perfectly consistent" snapshots.

FWIW, I don't see us continuing to do so if we go for a shared hashtable
for stats.

- Andres


Re: shared-memory based stats collector

От
Antonin Houska
Дата:
I've spent some time reviewing this version.

Design
------

1. Even with your patch the stats collector still uses an UDP socket to
   receive data. Now that the shared memory API is there, shouldn't the
   messages be sent via shared memory queue? [1] That would increase the
   reliability of message delivery.

   I can actually imagine backends inserting data into the shared hash tables
   themselves, but that might make them wait if the same entries are accessed
   by another backend. It should be much cheaper just to insert message into
   the queue and let the collector process it. In future version the collector
   can launch parallel workers so that writes by backends do not get blocked
   due to full queue.

2. I think the access to the shared hash tables introduces more contention
   than necessary. For example, pgstat_recv_tabstat() retrieves "dbentry" and
   leaves the containing hash table partition locked *exclusively* even if it
   changes only the containing table entries, while changes of the containing
   dbentry are done.

   It appears that the shared hash tables are only modified by the stats
   collector. The unnecessary use of the exclusive lock might be a bigger
   issue in the future if the stats collector will use parallel
   workers. Monitoring functions and autovacuum are affected by the locking
   now.

   (I see that the it's not trivial to get just-created entry locked in shared
   mode: it may need a loop in which we release the exclusive lock and acquire
   the shared lock unless the entry was already removed.)

3. Data in both shared_archiverStats and shared_globalStats is mostly accessed
   w/o locking. Is that ok? I'd expect the StatsLock to be used for these.

Coding
------

* git apply v4-0003-dshash-based-stats-collector.patch needed manual
  resolution of one conflict.

* pgstat_quickdie_handler() appears to be the only "quickdie handler" that
  calls on_exit_reset(), although the comments are almost copy & pasted from
  such a handler of other processes. Can you please explain what's specific
  about pgstat.c?

* the variable name "area" would be sufficient if it was local to some
  function, otherwise I think the name is too generic.

* likewise db_stats is too generic for a global variable. How about
  "snapshot_db_stats_local"?

* backend_get_db_entry() passes 0 for handle to snapshot_statentry(). How
  about DSM_HANDLE_INVALID ?

* I only see one call of snapshot_statentry_all() and it receives 0 for
  handle. Thus the argument can be removed and the function does not have to
  attach / detach to / from the shared hash table.

* backend_snapshot_global_stats() switches to TopMemoryContext before it calls
  pgstat_attach_shared_stats(), but the latter takes care of the context
  itself.

* pgstat_attach_shared_stats() - header comment should explain what the return
  value means.

* reset_dbentry_counters() does more than just resetting the counters. Name
  like initialize_dbentry() would be more descriptive.

* typos:

    ** backend_snapshot_global_stats(): "liftime" -> "lifetime"

    ** snapshot_statentry(): "entriy" -> "entry"

    ** backend_get_func_etnry(): "onshot" -> "oneshot"

    ** snapshot_statentry_all(): "Returns a local hash contains ..." -> "Returns a local hash containing ..."


[1] https://www.postgresql.org/message-id/20180711000605.sqjik3vqe5opqz33@alap3.anarazel.de

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2018-09-20 09:55:27 +0200, Antonin Houska wrote:
> I've spent some time reviewing this version.
> 
> Design
> ------
> 
> 1. Even with your patch the stats collector still uses an UDP socket to
>    receive data. Now that the shared memory API is there, shouldn't the
>    messages be sent via shared memory queue? [1] That would increase the
>    reliability of message delivery.
> 
>    I can actually imagine backends inserting data into the shared hash tables
>    themselves, but that might make them wait if the same entries are accessed
>    by another backend. It should be much cheaper just to insert message into
>    the queue and let the collector process it. In future version the collector
>    can launch parallel workers so that writes by backends do not get blocked
>    due to full queue.

I don't think either of these is right. I think it's crucial to get rid
of the UDP socket, but I think using a shmem queue is the wrong
approach. Not just because postgres' shm_mq is single-reader/writer, but
also because it's plainly unnecessary.  Backends should attempt to
update the shared hashtable, but acquire the necessary lock
conditionally, and leave the pending updates of the shared hashtable to
a later time if they cannot acquire the lock.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. Thank you for the comments.

At Thu, 20 Sep 2018 10:37:24 -0700, Andres Freund <andres@anarazel.de> wrote in
<20180920173724.5w2n2nwkxtyi4azw@alap3.anarazel.de>
> Hi,
> 
> On 2018-09-20 09:55:27 +0200, Antonin Houska wrote:
> > I've spent some time reviewing this version.
> > 
> > Design
> > ------
> > 
> > 1. Even with your patch the stats collector still uses an UDP socket to
> >    receive data. Now that the shared memory API is there, shouldn't the
> >    messages be sent via shared memory queue? [1] That would increase the
> >    reliability of message delivery.
> > 
> >    I can actually imagine backends inserting data into the shared hash tables
> >    themselves, but that might make them wait if the same entries are accessed
> >    by another backend. It should be much cheaper just to insert message into
> >    the queue and let the collector process it. In future version the collector
> >    can launch parallel workers so that writes by backends do not get blocked
> >    due to full queue.
> 
> I don't think either of these is right. I think it's crucial to get rid
> of the UDP socket, but I think using a shmem queue is the wrong
> approach. Not just because postgres' shm_mq is single-reader/writer, but
> also because it's plainly unnecessary.  Backends should attempt to
> update the shared hashtable, but acquire the necessary lock
> conditionally, and leave the pending updates of the shared hashtable to
> a later time if they cannot acquire the lock.

Ok, I just intended to avoid reading many bytes from a file and
thought that writer-side can be resolved later.

Currently locks on the shared stats table is acquired by dshash
mechanism in a partition-wise manner. The number of the
partitions is currently fixed to 2^7 = 128, but writes for the
same table confilicts each other regardless of the number of
partitions. As the first step, I'm going to add
conditional-locking capability to dsahsh_find_or_insert and each
backend holds a queue of its pending updates.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. This is a super-PoC of no-UDP stats collector.

At Wed, 26 Sep 2018 09:55:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180926.095509.182252925.horiguchi.kyotaro@lab.ntt.co.jp>
> > I don't think either of these is right. I think it's crucial to get rid
> > of the UDP socket, but I think using a shmem queue is the wrong
> > approach. Not just because postgres' shm_mq is single-reader/writer, but
> > also because it's plainly unnecessary.  Backends should attempt to
> > update the shared hashtable, but acquire the necessary lock
> > conditionally, and leave the pending updates of the shared hashtable to
> > a later time if they cannot acquire the lock.
> 
> Ok, I just intended to avoid reading many bytes from a file and
> thought that writer-side can be resolved later.
> 
> Currently locks on the shared stats table is acquired by dshash
> mechanism in a partition-wise manner. The number of the
> partitions is currently fixed to 2^7 = 128, but writes for the
> same table confilicts each other regardless of the number of
> partitions. As the first step, I'm going to add
> conditional-locking capability to dsahsh_find_or_insert and each
> backend holds a queue of its pending updates.

I don't have more time 'til next monday so this is just a PoC
(sorry..).

- 0001 to 0006 is rebased version of v4.
- 0007 adds conditional locking to dshash

- 0008 is the no-UDP stats collector.

If required lock is not acquired for some stats items, report
funcions immediately return after storing the values locally. The
stored values are merged with later calls. Explicitly calling
pgstat_cleanup_pending_stat() at a convenient timing tries to
apply the pending values, but the function is not called anywhere
for now.

stats collector process is used only to save and load saved stats
files and create shared memory for stats. I'm going to remove
stats collector.

I'll continue working this way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 86c11126fabafd1ca5637ed415b537ad7b1dec08 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 21:36:06 +0900
Subject: [PATCH 8/8] Ultra PoC of full-shared-memory stats collector.

This path is superultra PoC of full-shared-memory stats collector,
which means UDP is no longer involved in stats collector mechanism.
Some statistics items can be postponed when required lock is not
available, and they can be tried to clean up by calling
pgstat_cleanup_pending_stat() at a convenient time (not called in this
patch).
---
 src/backend/postmaster/pgstat.c     | 2416 +++++++++++++----------------------
 src/backend/storage/lmgr/deadlock.c |    2 +-
 src/include/pgstat.h                |  357 +-----
 3 files changed, 917 insertions(+), 1858 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a3d5f4856f..b73da9a7f2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -93,6 +93,23 @@
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
+#define PGSTAT_MAX_QUEUE_LEN    100
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_TABLE_READ    0
+#define PGSTAT_TABLE_WRITE    1
+#define PGSTAT_TABLE_CREATE 2
+#define    PGSTAT_TABLE_NOWAIT 4
+
+typedef enum
+{
+    PGSTAT_TABLE_NOT_FOUND,
+    PGSTAT_TABLE_FOUND,
+    PGSTAT_TABLE_LOCK_FAILED
+} pg_stat_table_result_status;
+
 /* ----------
  * Total number of backends including auxiliary
  *
@@ -119,16 +136,12 @@ int            pgstat_track_activity_query_size = 1024;
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
 static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
@@ -212,12 +225,6 @@ static HTAB *pgStatTabHash = NULL;
  */
 static HTAB *pgStatFunctions = NULL;
 
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -311,7 +318,8 @@ static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
 static void pgstat_write_statsfiles(void);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
@@ -323,10 +331,9 @@ static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -337,25 +344,13 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static dshash_table *pgstat_update_dbentry(Oid dboid);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat);
+static bool pgstat_update_funcentry(dshash_table *funchash,
+                                    PgStat_BackendFunctionEntry *stat);
+static bool pgstat_tabpurge(Oid dboid, Oid taboid);
+static bool pgstat_funcpurge(Oid dboid, Oid funcoid);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -374,280 +369,7 @@ static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 void
 pgstat_init(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
     return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
 }
 
 /*
@@ -713,226 +435,6 @@ allow_immediate_pgstat_restart(void)
     last_pgstat_start_time = 0;
 }
 
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_attach_shared_stats() -
  *
@@ -1008,6 +510,223 @@ pgstat_create_shared_stats(void)
     LWLockRelease(StatsLock);
 }
 
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+static bool pgstat_pending_tabstats = false;
+static bool pgstat_pending_funcstats = false;
+static bool pgstat_pending_vacstats = false;
+static bool pgstat_pending_dropdb = false;
+static bool pgstat_pending_resetcounter = false;
+static bool pgstat_pending_resetsharedcounter = false;
+static bool pgstat_pending_resetsinglecounter = false;
+static bool pgstat_pending_autovac = false;
+static bool pgstat_pending_vacuum = false;
+static bool pgstat_pending_analyze = false;
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
+
+void
+pgstat_cleanup_pending_stat(void)
+{
+    if (pgstat_pending_tabstats)
+        pgstat_report_stat(true);
+    if (pgstat_pending_funcstats)
+        pgstat_send_funcstats();
+    if (pgstat_pending_vacstats)
+        pgstat_vacuum_stat();
+    if (pgstat_pending_dropdb)
+        pgstat_drop_database(InvalidOid);
+    if (pgstat_pending_resetcounter)
+        pgstat_reset_counters();
+    if (pgstat_pending_resetsharedcounter)
+        pgstat_reset_shared_counters(NULL);
+    if (pgstat_pending_resetsinglecounter)
+        pgstat_reset_single_counter(InvalidOid, 0);
+    if (pgstat_pending_autovac)
+        pgstat_report_autovac(InvalidOid);
+    if (pgstat_pending_vacuum)
+        pgstat_report_vacuum(InvalidOid, false, 0, 0);
+    if (pgstat_pending_analyze)
+        pgstat_report_analyze(NULL, 0, 0, false);
+    if (pgstat_pending_recoveryconflict)
+        pgstat_report_recovery_conflict(-1);
+    if (pgstat_pending_deadlock)
+        pgstat_report_deadlock(true);
+    if (pgstat_pending_tempfile)
+        pgstat_report_tempfile(0);
+}
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to send the so far collected
+ *    per-table and function usage statistics to the collector.  Note that this
+ *    is called only when not within a transaction, so it is fair to use
+ *    transaction stop time as an approximation of current time.
+ * ----------
+ */
+void
+pgstat_report_stat(bool force)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_TableCounts all_zeroes;
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    int            i;
+    dshash_table *shared_tabhash = NULL;
+    dshash_table *regular_tabhash = NULL;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!pgstat_pending_tabstats &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+        !pgstat_pending_funcstats)
+        return;
+
+    /*
+     * Don't update shared stats aunless it's been at least
+     * PGSTAT_STAT_INTERVAL msec since we last sent one, or the caller wants
+     * to force stats out.
+     */
+    now = GetCurrentTransactionStopTimestamp();
+    if (!force &&
+        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+        return;
+    last_report = now;
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    pgstat_pending_tabstats = false;
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        int move_skipped_to = 0;
+
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            dshash_table *tabhash;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) != 0)
+            {
+                /*
+                 * OK, insert data into the appropriate message, and send if
+                 * full.
+                 */
+                if (entry->t_shared)
+                {
+                    if (!shared_tabhash)
+                        shared_tabhash = pgstat_update_dbentry(InvalidOid);
+                    tabhash = shared_tabhash;
+                }
+                else
+                {
+                    if (!regular_tabhash)
+                        regular_tabhash = pgstat_update_dbentry(MyDatabaseId);
+                    tabhash = regular_tabhash;
+                }
+
+                /*
+                 * If this entry failed to be processed, leave this entry for
+                 * the next turn. The enties should be in head-filled manner.
+                 */
+                if (!pgstat_update_tabentry(tabhash, entry))
+                {
+                    if (move_skipped_to < i)
+                        memmove(&tsa->tsa_entries[move_skipped_to],
+                                &tsa->tsa_entries[i],
+                                sizeof(PgStat_TableStatus));
+                    move_skipped_to++;
+                }
+            }
+        }
+
+        /* notify unapplied items are exists  */
+        if (move_skipped_to > 0)
+            pgstat_pending_tabstats = true;
+
+        tsa->tsa_used = move_skipped_to;
+        /* zero out TableStatus structs after use */
+        MemSet(&tsa->tsa_entries[tsa->tsa_used], 0,
+               (TABSTAT_QUANTUM - tsa->tsa_used) * sizeof(PgStat_TableStatus));
+    }
+
+    /* Now, send function statistics */
+    pgstat_send_funcstats();
+}
+
+/*
+ * Subroutine for pgstat_report_stat: populate and send a function stat message
+ */
+static void
+pgstat_send_funcstats(void)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+
+    PgStat_BackendFunctionEntry *entry;
+    HASH_SEQ_STATUS fstat;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+    dshash_table *funchash;
+
+    if (pgStatFunctions == NULL)
+        return;
+
+    pgstat_pending_funcstats = false;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE
+                                  | PGSTAT_TABLE_CREATE
+                                  | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        /* Skip it if no counts accumulated since last time */
+        if (memcmp(&entry->f_counts, &all_zeroes,
+                   sizeof(PgStat_FunctionCounts)) == 0)
+            continue;
+
+        if (pgstat_update_funcentry(funchash, entry))
+        {
+            /* reset the entry's counts */
+            MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+        else
+            pgstat_pending_funcstats = true;
+    }
+}
+
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -1018,17 +737,13 @@ void
 pgstat_vacuum_stat(void)
 {
     HTAB       *oidtab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
     dshash_table *dshtable;
     dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    pg_stat_table_result_status status;
+    bool        no_pending_stats;
 
     /* If not done for this transaction, take a snapshot of stats */
     if (!backend_snapshot_global_stats())
@@ -1060,11 +775,15 @@ pgstat_vacuum_stat(void)
     /* Clean up */
     hash_destroy(oidtab);
 
+    pgstat_pending_vacstats = true;
+
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = backend_get_db_entry(MyDatabaseId, true);
-    if (dbentry == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
         return;
     
     /*
@@ -1072,17 +791,13 @@ pgstat_vacuum_stat(void)
      */
     oidtab = pgstat_collect_oids(RelationRelationId);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
     /*
      * Check for all tables listed in stats hashtable if they still exist.
      * Stats cache is useless here so directly search the shared hash.
      */
     dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     dshash_seq_init(&dshstat, dshtable, false);
+    no_pending_stats = true;
     while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
@@ -1092,41 +807,11 @@ pgstat_vacuum_stat(void)
         if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
+        /* Not there, so purge this table */
+        if (!pgstat_tabpurge(MyDatabaseId, tabid))
+            no_pending_stats = false;
     }
     dshash_detach(dshtable);
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
     /* Clean up */
     hash_destroy(oidtab);
 
@@ -1139,10 +824,6 @@ pgstat_vacuum_stat(void)
     {
         oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
         dshash_seq_init(&dshstat, dshtable, false);
         while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
@@ -1153,39 +834,16 @@ pgstat_vacuum_stat(void)
             if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so move this function */
+            if (!pgstat_funcpurge(MyDatabaseId, funcid))
+                no_pending_stats = false;
         }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
         hash_destroy(oidtab);
     }
     dshash_detach(dshtable);
+
+    if (no_pending_stats)
+        pgstat_pending_vacstats = false;
 }
 
 
@@ -1247,50 +905,69 @@ pgstat_collect_oids(Oid catalogid)
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    static List *pending_dbid = NIL;
+    List *left_dbid = NIL;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+    ListCell *lc;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (OidIsValid(databaseid))
+        pending_dbid = lappend_oid(pending_dbid, databaseid);
+    pgstat_pending_dropdb = true;
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    foreach (lc, pending_dbid)
+    {
+        Oid dbid = lfirst_oid(lc);
+
+        /*
+         * Lookup the database in the hashtable.
+         */
+        dbentry = pgstat_get_db_entry(dbid,
+                                      PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                      &status);
+
+        /* skip on lock failure */
+        if (status == PGSTAT_TABLE_LOCK_FAILED)
+        {
+            left_dbid = lappend_oid(left_dbid, dbid);
+            continue;
+        }
+
+        /*
+         * If found, remove it (along with the db statfile).
+         */
+        if (dbentry)
+        {
+            if (dbentry->tables != DSM_HANDLE_INVALID)
+            {
+                dshash_table *tbl =
+                    dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+                dshash_destroy(tbl);
+            }
+            if (dbentry->functions != DSM_HANDLE_INVALID)
+            {
+                dshash_table *tbl =
+                    dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+                dshash_destroy(tbl);
+            }
+
+
+            dshash_delete_entry(db_stats, (void *)dbentry);
+        }
+    }
+
+    list_free(pending_dbid);
+    pending_dbid = left_dbid;
+
+    /*  we're done if no pending database ids */
+    if (list_length(pending_dbid) == 0)
+        pgstat_pending_dropdb = false;
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
@@ -1303,14 +980,59 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    pgstat_pending_resetcounter = true;
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    if (!dbentry)
+    {
+        pgstat_pending_resetcounter = false;
+        return;
+    }
+
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+
+    /*  we're done */
+    pgstat_pending_resetcounter = false;
 }
 
 /* ----------
@@ -1325,23 +1047,53 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    static bool archiver_pending = false;
+    static bool bgwriter_pending = false;
+    bool    have_lock = false;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        archiver_pending = true;
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
+    {
+        bgwriter_pending = true;
+    }
+    else if (target != NULL)
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_resetsharedcounter = true;
+
+    if (db_stats == NULL)
+        return;
+
+    /* Reset the archiver statistics for the cluster. */
+    if (archiver_pending && LWLockConditionalAcquire(StatsLock, LW_EXCLUSIVE))
+    {
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+        archiver_pending = false;
+        have_lock = true;
+    }
+
+    if (bgwriter_pending &&
+        (have_lock || LWLockConditionalAcquire(StatsLock, LW_EXCLUSIVE)))
+    {
+        /* Reset the global background writer statistics for the cluster. */
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+        bgwriter_pending = false;
+        have_lock = true;
+    }
+
+    if (have_lock)
+        LWLockRelease(StatsLock);
+
+    /* notify any pending update  */
+    pgstat_pending_resetsharedcounter =    (archiver_pending || bgwriter_pending);
 }
 
 /* ----------
@@ -1356,17 +1108,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_WRITE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -1380,16 +1152,23 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1403,19 +1182,43 @@ void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1432,9 +1235,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
     /*
@@ -1463,15 +1271,42 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1483,15 +1318,67 @@ pgstat_report_analyze(Relation rel,
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    static int pending_conflict_tablespace = 0;
+    static int pending_conflict_lock = 0;
+    static int pending_conflict_snapshot = 0;
+    static int pending_conflict_bufferpin = 0;
+    static int pending_conflict_startup_deadlock = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+    pgstat_pending_recoveryconflict = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+
+    dshash_release_lock(db_stats, dbentry);
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
@@ -1501,16 +1388,31 @@ pgstat_report_recovery_conflict(int reason)
  * --------
  */
 void
-pgstat_report_deadlock(void)
+pgstat_report_deadlock(bool pending)
 {
-    PgStat_MsgDeadlock msg;
+    static int pending_deadlocks = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1522,34 +1424,36 @@ pgstat_report_deadlock(void)
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    static size_t pending_filesize = 0;
+    static size_t pending_files = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+    pgstat_pending_tempfile = true;
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1670,7 +1574,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
     INSTR_TIME_ADD(fs->f_self_time, f_self);
 
     /* indicate that we have something to send */
-    have_function_stats = true;
+    pgstat_pending_funcstats = true;
 }
 
 
@@ -1691,6 +1595,7 @@ pgstat_initstats(Relation rel)
 {
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
+    MemoryContext oldcontext;
 
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
@@ -1703,7 +1608,14 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    if (db_stats == NULL || !pgstat_track_counts)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -2426,7 +2338,7 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatFuncEntry *funcentry = NULL;
 
     /* Lookup our database, then find the requested function */
-    dbentry = pgstat_get_db_entry(MyDatabaseId, false);
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_READ, NULL);
     if (dbentry == NULL)
         return NULL;
 
@@ -2721,7 +2633,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -4105,49 +4017,6 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -4158,16 +4027,22 @@ pgstat_send(void *msg, int len)
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
@@ -4180,21 +4055,36 @@ void
 pgstat_send_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+    MemoryContext oldcontext;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4215,8 +4105,6 @@ pgstat_send_bgwriter(void)
 void
 PgstatCollectorMain(void)
 {
-    int            len;
-    PgStat_Msg    msg;
     int            wr;
 
     /*
@@ -4246,20 +4134,6 @@ PgstatCollectorMain(void)
     pgStatRunningInCollector = true;
     pgstat_read_statsfiles();
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
     for (;;)
     {
         /* Clear any already-pending wakeups */
@@ -4272,164 +4146,16 @@ PgstatCollectorMain(void)
             break;
 
         /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
+         * Reload configuration if we got SIGHUP from the postmaster.
          */
-        while (!got_SIGTERM)
+        if (got_SIGHUP)
         {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+        
+        wr = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1L,
+                       WAIT_EVENT_PGSTAT_MAIN);
 
         /*
          * Emergency bailout if postmaster has died.  This is to avoid the
@@ -4437,8 +4163,8 @@ PgstatCollectorMain(void)
          */
         if (wr & WL_POSTMASTER_DEATH)
             break;
-    }                            /* end of outer loop */
-
+    }
+        
     /*
      * Save the final stats to reuse at next startup.
      */
@@ -4552,29 +4278,62 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
+    bool        nowait = ((op & PGSTAT_TABLE_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+    MemoryContext oldcontext;
 
-    Assert(pgStatRunningInCollector);
+    /* XXXXXXX */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
 
     /* Lookup or create the hash table entry for this database */
-    if (create)
+    if (op & PGSTAT_TABLE_CREATE)
+    {
         result = (PgStat_StatDBEntry *)
-            dshash_find_or_insert(db_stats,    &databaseid, &found);
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
     else
-        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
 
-    if (!create)
-        return result;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_TABLE_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_TABLE_NOT_FOUND;
+        else
+            *status = PGSTAT_TABLE_FOUND;
+    }
 
     return result;
 }
@@ -5646,108 +5405,124 @@ pgstat_clear_snapshot(void)
  *    Count what the backend has done.
  * ----------
  */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+static dshash_table *
+pgstat_update_dbentry(Oid dboid)
 {
-    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
+    pg_stat_table_result_status status;
+    dshash_table *tabhash;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE
+                                  | PGSTAT_TABLE_CREATE
+                                  | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+    
+    /* return if lock failed */
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return NULL;
 
     tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    
+    if (OidIsValid(dboid))
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *)
-            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-        dshash_release_lock(tabhash, tabentry);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Update database-wide stats.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+        dbentry->n_xact_commit += pgStatXactCommit;
+        dbentry->n_xact_rollback += pgStatXactRollback;
+        dbentry->n_block_read_time += pgStatBlockReadTime;
+        dbentry->n_block_write_time += pgStatBlockWriteTime;
     }
 
+    pgStatXactCommit = 0;
+    pgStatXactRollback = 0;
+    pgStatBlockReadTime = 0;
+    pgStatBlockWriteTime = 0;
+
     dshash_release_lock(db_stats, dbentry);
+
+    return tabhash;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, true);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
 
@@ -5757,14 +5532,15 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
  *    Arrange for dead table removal.
  * ----------
  */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
+static bool
+pgstat_tabpurge(Oid dboid, Oid taboid)
 {
     dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
-    int            i;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* wait for lock */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_TABLE_WRITE, NULL);
+
     /*
      * No need to purge if we don't even know the database.
      */
@@ -5772,459 +5548,67 @@ pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
     {
         if (dbentry)
             dshash_release_lock(db_stats, dbentry);
-        return;
+        return true;
     }
 
     tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
-    }
+
+    /* Remove from hashtable if present; we don't care if it's not. */
+    (void) dshash_delete_key(tbl, (void *) &taboid);
 
     dshash_release_lock(db_stats, dbentry);
 
+    return true;
 }
 
 
 /* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        if (dbentry->tables != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-            dshash_destroy(tbl);
-        }
-        if (dbentry->functions != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-            dshash_destroy(tbl);
-        }
-
-        dshash_delete_entry(db_stats, (void *)dbentry);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_destroy(t);
-        dbentry->tables = DSM_HANDLE_INVALID;
-    }
-    if (dbentry->functions != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_destroy(t);
-        dbentry->functions = DSM_HANDLE_INVALID;
-    }
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
-        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
-        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-    else if (msg->m_resettype == RESET_FUNCTION)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++shared_archiverStats->failed_count;
-        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_failed_wal));
-        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++shared_archiverStats->archived_count;
-        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_archived_wal));
-        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
-    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
-    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
-    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
-    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
-    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
-    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
-    shared_globalStats->buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
+ * pgstat_funcstat() -
  *
  *    Count what the backend has done.
  * ----------
  */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
+static bool
+pgstat_update_funcentry(dshash_table *funchash,
+                        PgStat_BackendFunctionEntry *stat)
 {
-    dshash_table *t;
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
-    int            i;
     bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+    funcentry = (PgStat_StatFuncEntry *)
+        dshash_find_or_insert_extended(funchash, (void *) &(stat->f_id),
+                                       &found, true);
 
-    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
+    if (funcentry == NULL)
+        return false;
+
+    if (!found)
     {
-        funcentry = (PgStat_StatFuncEntry *)
-            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-        dshash_release_lock(t, funcentry);
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        funcentry->f_numcalls = stat->f_counts.f_numcalls;
+        funcentry->f_total_time =
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_total_time);
+        funcentry->f_self_time =
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        funcentry->f_numcalls += stat->f_counts.f_numcalls;
+        funcentry->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_total_time);
+        funcentry->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_self_time);
     }
 
-    dshash_detach(t);
-    dshash_release_lock(db_stats, dbentry);
+    dshash_release_lock(funchash, funcentry);
+
+    return true;
 }
 
 /* ----------
@@ -6233,32 +5617,30 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
  *    Arrange for dead function removal.
  * ----------
  */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+static bool
+pgstat_funcpurge(Oid dboid, Oid funcoid)
 {
     dshash_table *t;
     PgStat_StatDBEntry *dbentry;
-    int            i;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* wait for lock */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_TABLE_WRITE, NULL);
 
     /*
      * No need to purge if we don't even know the database.
      */
     if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
-        return;
+        return true;
 
     t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
-    }
+
+    /* Remove from hashtable if present; we don't care if it's not. */
+    dshash_delete_key(t, (void *) &funcoid);
     dshash_detach(t);
+
     dshash_release_lock(db_stats, dbentry);
+
+    return true;
 }
 
 /*
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..2b64d313b9 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -1130,7 +1130,7 @@ DeadLockReport(void)
                          pgstat_get_backend_current_activity(info->pid, false));
     }
 
-    pgstat_report_deadlock();
+    pgstat_report_deadlock(false);
 
     ereport(ERROR,
             (errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index afc1927250..ff97d6ab6e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -39,31 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -112,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -177,242 +145,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -453,78 +202,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -1111,7 +788,7 @@ extern char *pgstat_stat_filename;
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1135,8 +812,6 @@ extern void allow_immediate_pgstat_restart(void);
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
 extern void pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
@@ -1154,7 +829,7 @@ extern void pgstat_report_analyze(Relation rel,
                       bool resetcounter);
 
 extern void pgstat_report_recovery_conflict(int reason);
-extern void pgstat_report_deadlock(void);
+extern void pgstat_report_deadlock(bool pending);
 
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
@@ -1328,4 +1003,6 @@ extern void PgstatCollectorMain(void) pg_attribute_noreturn();
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
+extern void pgstat_cleanup_pending_stat(void);
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From fdc3f14554cb80aa0cef1b3f75aa8978e1671cde Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 7/8] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find anddshash_find_or_insert. The new
interfaces has an extra parameter "nowait" to command to return
immediately if required lock is not aqcuired.
---
 src/backend/lib/dshash.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 5b133226ac..7584931515 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,17 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, NULL, exclusive, false);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool *lock_acquired, bool exclusive, bool nowait)
 {
     dshash_hash hash;
     size_t        partition;
@@ -394,8 +405,23 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -430,6 +456,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -444,8 +486,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8598d9ed84..b207585eeb 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -89,8 +89,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool *lock_acquired, bool exclusive, bool nowait);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From d926ffaa496198d5b6aec91163f3b5e38bd1dec2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:46:43 +0900
Subject: [PATCH 6/8] Remove pg_stat_tmp exclusion from pg_rewind

The directory "pg_stat_tmp" no longer exists so remove it from the
exclusion list.
---
 src/bin/pg_rewind/filemap.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 4ad7b2f207..075697be44 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, bool is_source);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
-- 
2.16.3

From c002149ee9eb8ef352930bedd132d70d8f83e898 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 10:59:17 +0900
Subject: [PATCH 5/8] Let pg_stat_statements not to use PG_STAT_TMP_DIR.

This patchset removes the definition because pg_stat.c no longer uses
the directory and no other sutable module to pass it over. As a
tentative solution this patch moves query text file into permanent
stats directory. pg_basebackup and pg_rewind are conscious of the
directory. They currently omit the text file but becomes to copy it by
this change.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79f54..ec2fa9881c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
-- 
2.16.3

From f8d185198145e9319cbbefd9355aa07d32606ba4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 4/8] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f11b8f724c..7a2cf74e6c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6114,25 +6114,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9a7f683658..b6ad7bbed5 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15953,8 +15953,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0484cfa77a..bd50efcec8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f0b2145208..11f263f378 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2612,7 +2612,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index e2662bbf81..bf9c5dd580 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -270,8 +270,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From 66841965d11bff65b029c7f1fa8b5ecf0cd47b2f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 17:05:46 +0900
Subject: [PATCH 3/8] dshash-based stats collector

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash. The stats entries are cached
one by one to give a consistent snapshot during a transaction. On the
other hand vacuum no longer take a complete cache of stats.

This patch removes PG_STAT_TMP_DIR and GUC stats_temp_directory.  That
affects pg_basebackup and pg_stat_statements but this patch fixes only
pg_basbackup. Fix for pg_stat_statements is done in another patch.
---
 src/backend/postmaster/autovacuum.c           |   59 +-
 src/backend/postmaster/pgstat.c               | 1566 ++++++++++++-------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/pgstat.h                          |   50 +-
 src/include/storage/lwlock.h                  |    3 +
 12 files changed, 836 insertions(+), 929 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 978089575b..65956c0c35 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -977,7 +977,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = backend_get_db_entry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -986,6 +986,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -1001,7 +1002,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = backend_get_db_entry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1013,6 +1014,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1025,7 +1027,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = backend_get_db_entry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1037,6 +1039,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1235,7 +1238,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = backend_get_db_entry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1273,16 +1276,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1971,7 +1980,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2021,7 +2030,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = backend_get_db_entry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2107,6 +2116,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2186,10 +2197,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2758,12 +2770,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2795,8 +2805,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = backend_get_db_entry(InvalidOid, true);
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2827,6 +2837,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2917,7 +2929,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 999325ae53..a3d5f4856f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -77,22 +77,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +89,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -127,14 +114,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -154,6 +133,43 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +266,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,23 +305,23 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+static void pgstat_read_current_status(void);
 
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
@@ -320,7 +340,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +704,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -915,6 +933,81 @@ pgstat_send_funcstats(void)
 }
 
 
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* top level varialbles. lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -924,10 +1017,11 @@ pgstat_send_funcstats(void)
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
+    HTAB       *oidtab;
     PgStat_MsgTabpurge msg;
     PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
@@ -936,23 +1030,22 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId);
+    oidtab = pgstat_collect_oids(DatabaseRelationId);
 
     /*
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -960,26 +1053,24 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = backend_get_db_entry(MyDatabaseId, true);
+    if (dbentry == NULL)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId);
+    oidtab = pgstat_collect_oids(RelationRelationId);
 
     /*
      * Initialize our messages table counter to zero
@@ -988,15 +1079,17 @@ pgstat_vacuum_stat(void)
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
         /*
@@ -1019,6 +1112,7 @@ pgstat_vacuum_stat(void)
             msg.m_nentries = 0;
         }
     }
+    dshash_detach(dshtable);
 
     /*
      * Send the rest
@@ -1034,29 +1128,29 @@ pgstat_vacuum_stat(void)
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    if (dshash_get_num_entries(dshtable) > 0)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId);
+        oidtab = pgstat_collect_oids(ProcedureRelationId);
 
         pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
             /*
@@ -1089,8 +1183,9 @@ pgstat_vacuum_stat(void)
             pgstat_send(&f_msg, len);
         }
 
-        hash_destroy(htab);
+        hash_destroy(oidtab);
     }
+    dshash_detach(dshtable);
 }
 
 
@@ -1457,24 +1552,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2289,18 +2366,10 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    PgStat_StatDBEntry *dbentry;
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    dbentry = backend_get_db_entry(dbid, false);
+    return dbentry;
 }
 
 
@@ -2316,47 +2385,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = backend_get_db_entry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = backend_get_db_entry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2375,17 +2425,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+    if (funcentry == NULL)
+        return NULL;
 
     return funcentry;
 }
@@ -2461,9 +2508,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2478,9 +2527,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4186,18 +4237,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4239,13 +4286,6 @@ PgstatCollectorMain(void)
                 ProcessConfigFile(PGC_SIGHUP);
             }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
             /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
@@ -4294,10 +4334,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4386,9 +4422,7 @@ PgstatCollectorMain(void)
          * fixes that, so don't sleep indefinitely.  This is a crock of the
          * first water, but until somebody wants to debug exactly what's
          * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * timeout matches our pre-9.2 behavior.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4408,7 +4442,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4466,14 +4500,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4499,20 +4533,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4525,15 +4556,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4545,23 +4579,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4599,25 +4633,20 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * pgstat_write_statsfiles() -
  *        Write the global statistics file, as well as requested DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
  *    When 'allDbs' is false, only the requested databases (listed in
  *    pending_write_requests) will be written; otherwise, all databases
  *    will be written.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4638,7 +4667,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4650,32 +4679,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4719,9 +4745,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
     /*
      * Now throw away the list of requests.  Note that requests sent after we
      * started the write are still waiting on the network socket.
@@ -4735,15 +4758,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4761,10 +4783,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4773,9 +4795,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4802,24 +4825,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl, false);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4853,76 +4880,45 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4940,7 +4936,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -4957,11 +4953,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -4972,17 +4968,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5011,12 +5006,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5024,8 +5019,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5033,47 +5028,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5090,34 +5061,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5128,7 +5112,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5187,12 +5172,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5200,6 +5186,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5221,9 +5208,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5234,6 +5221,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5253,276 +5241,355 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * snapshot_statentry_all() - Take a snapshot of all shared stats entries
+ *
+ * Returns a local hash contains all entries in the shared stats.
+ *
+ * The given dshash is used if any. Elsewise temporarily attach dsh_handle.
+ */
+static HTAB *
+snapshot_statentry_all(const char *hashname,
+                       dshash_table *dshash, dshash_table_handle dsh_handle,
+                       const dshash_parameters *dsh_params)
+{
+    dshash_table *t;
+    dshash_seq_status s;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+    void *ps;
+    int num_entries;
+    HTAB *dest;
+
+    t = dshash ? dshash : dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+    /*
+     * No need to create new hash if no entry exists. The number can change
+     * after this, but dynahash can store extra entries in the case.
+     */
+    num_entries = dshash_get_num_entries(t);
+    if (num_entries == 0)
+    {
+        dshash_detach(t);
+        return NULL;
+    }
+
+    Assert(hashname);
+    dest = create_local_stats_hash(hashname,
+                                    keysize, entrysize, num_entries);
+
+    dshash_seq_init(&s, t, true);
+    while ((ps = dshash_seq_next(&s)) != NULL)
+    {
+        bool found;
+        void *pd = hash_search(dest, ps, HASH_ENTER, &found);
+        Assert(!found);
+        memcpy(pd, ps, entrysize);
+        /* dshash_seq_next releases entry lock automatically */
+    }
+
+    if (!dshash)
+        dshash_detach(t);
+
+    return dest;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(!pgStatRunningInCollector);
+
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * backend_get_db_entry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+backend_get_db_entry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                              hashname, db_stats, 0, &dsh_dbparams,
+                              dbid);
+}
+
+/* ----------
+ * backend_snapshot_all_db_entries() -
+ *
+ *    Take a snapshot of all databsae stats at once into returned hash.
+ */
+HTAB *
+backend_snapshot_all_db_entries(void)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_statentry_all(hashname, db_stats, 0, &dsh_dbparams);
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5553,6 +5620,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5562,99 +5631,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
 
     /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5667,6 +5649,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5682,6 +5665,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5689,9 +5673,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5750,6 +5733,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5762,6 +5746,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5774,27 +5760,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5820,23 +5812,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5864,19 +5853,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5891,14 +5889,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -5928,11 +5926,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5952,6 +5958,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5965,13 +5973,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -5986,6 +5994,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5999,13 +6010,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6028,6 +6041,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6043,18 +6059,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6067,16 +6083,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6117,6 +6133,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6133,6 +6151,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6150,6 +6170,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6161,6 +6183,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6169,14 +6192,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6197,7 +6220,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6209,6 +6236,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6217,60 +6245,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 91ae448955..5ff62fa0dc 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..ee30e8a14f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e9f542cfed..a0fa3dac4e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -187,7 +187,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3778,17 +3777,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10721,35 +10709,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..1277740473 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -512,7 +512,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f0c1..f13b2dde6b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e97b25bd72..afc1927250 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1213,6 +1183,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1352,4 +1325,7 @@ extern PgStat_GlobalStats *pgstat_fetch_global(void);
 /* Main loop */
 extern void PgstatCollectorMain(void) pg_attribute_noreturn();
 
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..2cdd10c2fd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.16.3

From 0f68698ac32dff2c8fad1984cc2da55b5aac7113 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 2/8] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |   8 ++
 src/backend/postmaster/pgstat.c     | 158 +++++++++++-------------------------
 src/backend/postmaster/postmaster.c |  30 +++----
 src/include/miscadmin.h             |   3 +-
 src/include/pgstat.h                |  11 ++-
 5 files changed, 77 insertions(+), 133 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d..ece200877c 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -336,6 +336,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -470,6 +473,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8a5b2b3b42..999325ae53 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -688,104 +689,6 @@ pgstat_reset_all(void)
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 void
 allow_immediate_pgstat_restart(void)
 {
@@ -2870,6 +2773,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4135,6 +4041,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4252,8 +4161,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4266,8 +4175,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4312,14 +4221,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4507,14 +4416,29 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4529,6 +4453,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 305ff36258..b273fa0717 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -144,7 +144,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -550,6 +551,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2880,7 +2882,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2953,7 +2955,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3039,10 +3041,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3272,7 +3274,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -4951,12 +4953,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         PgArchiverMain(argc, argv); /* does not return */
     }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5073,7 +5069,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5370,6 +5366,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 69f356f8cd..433d1ed0eb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,7 +400,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
@@ -412,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d59c24ae23..e97b25bd72 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1160,11 +1161,6 @@ extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
 extern void allow_immediate_pgstat_restart(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
-
 /* ----------
  * Functions called from backends
  * ----------
@@ -1353,4 +1349,7 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* Main loop */
+extern void PgstatCollectorMain(void) pg_attribute_noreturn();
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3

From 0a9053c6d54c5b80649504d7192fe5fd772110c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/8] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h |  23 +++++++-
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..5b133226ac 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+    status->curpartition = -1;
+    status->consistent = consistent;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
+    }
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+        status->hash_table->find_locked = true;
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+        Assert(status->hash_table->find_locked);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+        /*
+         * we need a lock on the scanning partition even if the caller don't
+         * requested a consistent snapshot.
+         */
+        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        {
+            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+                                                       next_item_pointer);
+            int next_partition = PARTITION_FOR_HASH(item->hash);
+            if (status->curpartition != next_partition)
+            {
+                if (status->curpartition >= 0)
+                    LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                                 status->curpartition));
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              LW_SHARED);
+                status->curpartition = next_partition;
+            }
+        }
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->find_locked);
+    status->hash_table->find_locked = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table, false);
+    while ((p = dshash_seq_next(&s)) != NULL)
+        n++;
+
+    return n;
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8598d9ed84 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -15,6 +15,7 @@
 #define DSHASH_H
 
 #include "utils/dsa.h"
+#include "utils/hsearch.h"
 
 /* The opaque type representing a hash table. */
 struct dshash_table;
@@ -59,6 +60,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    int                    curpartition;
+    bool                consistent;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
The previous patch doesn't work...

At Thu, 27 Sep 2018 22:00:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180927.220049.168546206.horiguchi.kyotaro@lab.ntt.co.jp>
> - 0001 to 0006 is rebased version of v4.
> - 0007 adds conditional locking to dshash
> 
> - 0008 is the no-UDP stats collector.
> 
> If required lock is not acquired for some stats items, report
> funcions immediately return after storing the values locally. The
> stored values are merged with later calls. Explicitly calling
> pgstat_cleanup_pending_stat() at a convenient timing tries to
> apply the pending values, but the function is not called anywhere
> for now.
> 
> stats collector process is used only to save and load saved stats
> files and create shared memory for stats. I'm going to remove
> stats collector.
> 
> I'll continue working this way.

It doesn't work nor even compile since I failed to include some
changes. The atached v6-0008 at least compiles and words.

0001-0007 are not attached since they are still aplicable on
master head with offsets.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ffbe9d78239352df9ff9edac3e66675117703d88 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 21:36:06 +0900
Subject: [PATCH 8/8] Ultra PoC of full-shared-memory stats collector.

This path is superultra PoC of full-shared-memory stats collector,
which means UDP is no longer involved in stats collector mechanism.
Some statistics items can be postponed when required lock is not
available, and they can be tried to clean up by calling
pgstat_cleanup_pending_stat() at a convenient time (not called in this
patch).
---
 src/backend/access/transam/xlog.c     |    4 +-
 src/backend/postmaster/checkpointer.c |    8 +-
 src/backend/postmaster/pgstat.c       | 2424 ++++++++++++---------------------
 src/backend/storage/buffer/bufmgr.c   |    8 +-
 src/backend/storage/lmgr/deadlock.c   |    2 +-
 src/backend/utils/adt/pgstatfuncs.c   |    2 +-
 src/include/pgstat.h                  |  357 +----
 7 files changed, 936 insertions(+), 1869 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffc..980c7e9e0e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8604,9 +8604,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..62e1ee7ace 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -376,7 +376,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -402,7 +402,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -1296,8 +1296,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a3d5f4856f..339425720f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -93,6 +93,23 @@
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
+#define PGSTAT_MAX_QUEUE_LEN    100
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_TABLE_READ    0
+#define PGSTAT_TABLE_WRITE    1
+#define PGSTAT_TABLE_CREATE 2
+#define    PGSTAT_TABLE_NOWAIT 4
+
+typedef enum
+{
+    PGSTAT_TABLE_NOT_FOUND,
+    PGSTAT_TABLE_FOUND,
+    PGSTAT_TABLE_LOCK_FAILED
+} pg_stat_table_result_status;
+
 /* ----------
  * Total number of backends including auxiliary
  *
@@ -119,16 +136,12 @@ int            pgstat_track_activity_query_size = 1024;
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
 static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
@@ -212,12 +225,6 @@ static HTAB *pgStatTabHash = NULL;
  */
 static HTAB *pgStatFunctions = NULL;
 
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -311,7 +318,8 @@ static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
 static void pgstat_write_statsfiles(void);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
@@ -323,10 +331,9 @@ static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -337,25 +344,13 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static dshash_table *pgstat_update_dbentry(Oid dboid);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat);
+static bool pgstat_update_funcentry(dshash_table *funchash,
+                                    PgStat_BackendFunctionEntry *stat);
+static bool pgstat_tabpurge(Oid dboid, Oid taboid);
+static bool pgstat_funcpurge(Oid dboid, Oid funcoid);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -374,280 +369,7 @@ static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 void
 pgstat_init(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
     return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
 }
 
 /*
@@ -713,226 +435,6 @@ allow_immediate_pgstat_restart(void)
     last_pgstat_start_time = 0;
 }
 
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_attach_shared_stats() -
  *
@@ -1008,6 +510,225 @@ pgstat_create_shared_stats(void)
     LWLockRelease(StatsLock);
 }
 
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+static bool pgstat_pending_tabstats = false;
+static bool pgstat_pending_funcstats = false;
+static bool pgstat_pending_vacstats = false;
+static bool pgstat_pending_dropdb = false;
+static bool pgstat_pending_resetcounter = false;
+static bool pgstat_pending_resetsharedcounter = false;
+static bool pgstat_pending_resetsinglecounter = false;
+static bool pgstat_pending_autovac = false;
+static bool pgstat_pending_vacuum = false;
+static bool pgstat_pending_analyze = false;
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
+
+void
+pgstat_cleanup_pending_stat(void)
+{
+    if (pgstat_pending_tabstats)
+        pgstat_report_stat(true);
+    if (pgstat_pending_funcstats)
+        pgstat_send_funcstats();
+    if (pgstat_pending_vacstats)
+        pgstat_vacuum_stat();
+    if (pgstat_pending_dropdb)
+        pgstat_drop_database(InvalidOid);
+    if (pgstat_pending_resetcounter)
+        pgstat_reset_counters();
+    if (pgstat_pending_resetsharedcounter)
+        pgstat_reset_shared_counters(NULL);
+    if (pgstat_pending_resetsinglecounter)
+        pgstat_reset_single_counter(InvalidOid, 0);
+    if (pgstat_pending_autovac)
+        pgstat_report_autovac(InvalidOid);
+    if (pgstat_pending_vacuum)
+        pgstat_report_vacuum(InvalidOid, false, 0, 0);
+    if (pgstat_pending_analyze)
+        pgstat_report_analyze(NULL, 0, 0, false);
+    if (pgstat_pending_recoveryconflict)
+        pgstat_report_recovery_conflict(-1);
+    if (pgstat_pending_deadlock)
+        pgstat_report_deadlock(true);
+    if (pgstat_pending_tempfile)
+        pgstat_report_tempfile(0);
+}
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to send the so far collected
+ *    per-table and function usage statistics to the collector.  Note that this
+ *    is called only when not within a transaction, so it is fair to use
+ *    transaction stop time as an approximation of current time.
+ * ----------
+ */
+void
+pgstat_report_stat(bool force)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_TableCounts all_zeroes;
+    static TimestampTz last_report = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    int            i;
+    dshash_table *shared_tabhash = NULL;
+    dshash_table *regular_tabhash = NULL;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!pgstat_pending_tabstats &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+        !pgstat_pending_funcstats)
+        return;
+
+    /*
+     * Don't update shared stats aunless it's been at least
+     * PGSTAT_STAT_INTERVAL msec since we last sent one, or the caller wants
+     * to force stats out.
+     */
+    now = GetCurrentTransactionStopTimestamp();
+    if (!force &&
+        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
+        return;
+    last_report = now;
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    pgstat_pending_tabstats = false;
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        int move_skipped_to = 0;
+
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            dshash_table *tabhash;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) != 0)
+            {
+                /*
+                 * OK, insert data into the appropriate message, and send if
+                 * full.
+                 */
+                if (entry->t_shared)
+                {
+                    if (!shared_tabhash)
+                        shared_tabhash = pgstat_update_dbentry(InvalidOid);
+                    tabhash = shared_tabhash;
+                }
+                else
+                {
+                    if (!regular_tabhash)
+                        regular_tabhash = pgstat_update_dbentry(MyDatabaseId);
+                    tabhash = regular_tabhash;
+                }
+
+                /*
+                 * If this entry failed to be processed, leave this entry for
+                 * the next turn. The enties should be in head-filled manner.
+                 */
+                if (!pgstat_update_tabentry(tabhash, entry))
+                {
+                    if (move_skipped_to < i)
+                        memmove(&tsa->tsa_entries[move_skipped_to],
+                                &tsa->tsa_entries[i],
+                                sizeof(PgStat_TableStatus));
+                    move_skipped_to++;
+                }
+            }
+        }
+
+        /* notify unapplied items are exists  */
+        if (move_skipped_to > 0)
+            pgstat_pending_tabstats = true;
+
+        tsa->tsa_used = move_skipped_to;
+        /* zero out TableStatus structs after use */
+        MemSet(&tsa->tsa_entries[tsa->tsa_used], 0,
+               (TABSTAT_QUANTUM - tsa->tsa_used) * sizeof(PgStat_TableStatus));
+    }
+
+    /* Now, send function statistics */
+    pgstat_send_funcstats();
+}
+
+/*
+ * Subroutine for pgstat_report_stat: populate and send a function stat message
+ */
+static void
+pgstat_send_funcstats(void)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+
+    PgStat_BackendFunctionEntry *entry;
+    HASH_SEQ_STATUS fstat;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+    dshash_table *funchash;
+
+    if (pgStatFunctions == NULL)
+        return;
+
+    pgstat_pending_funcstats = false;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE
+                                  | PGSTAT_TABLE_CREATE
+                                  | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        /* Skip it if no counts accumulated since last time */
+        if (memcmp(&entry->f_counts, &all_zeroes,
+                   sizeof(PgStat_FunctionCounts)) == 0)
+            continue;
+
+        if (pgstat_update_funcentry(funchash, entry))
+        {
+            /* reset the entry's counts */
+            MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+        else
+            pgstat_pending_funcstats = true;
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -1018,17 +739,13 @@ void
 pgstat_vacuum_stat(void)
 {
     HTAB       *oidtab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
     dshash_table *dshtable;
     dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    pg_stat_table_result_status status;
+    bool        no_pending_stats;
 
     /* If not done for this transaction, take a snapshot of stats */
     if (!backend_snapshot_global_stats())
@@ -1060,11 +777,15 @@ pgstat_vacuum_stat(void)
     /* Clean up */
     hash_destroy(oidtab);
 
+    pgstat_pending_vacstats = true;
+
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = backend_get_db_entry(MyDatabaseId, true);
-    if (dbentry == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
         return;
     
     /*
@@ -1072,17 +793,13 @@ pgstat_vacuum_stat(void)
      */
     oidtab = pgstat_collect_oids(RelationRelationId);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
     /*
      * Check for all tables listed in stats hashtable if they still exist.
      * Stats cache is useless here so directly search the shared hash.
      */
     dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     dshash_seq_init(&dshstat, dshtable, false);
+    no_pending_stats = true;
     while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
@@ -1092,41 +809,11 @@ pgstat_vacuum_stat(void)
         if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
+        /* Not there, so purge this table */
+        if (!pgstat_tabpurge(MyDatabaseId, tabid))
+            no_pending_stats = false;
     }
     dshash_detach(dshtable);
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
     /* Clean up */
     hash_destroy(oidtab);
 
@@ -1139,10 +826,6 @@ pgstat_vacuum_stat(void)
     {
         oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
         dshash_seq_init(&dshstat, dshtable, false);
         while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
@@ -1153,39 +836,17 @@ pgstat_vacuum_stat(void)
             if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so move this function */
+            if (!pgstat_funcpurge(MyDatabaseId, funcid))
+                no_pending_stats = false;
         }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
         hash_destroy(oidtab);
     }
     dshash_detach(dshtable);
+    dshash_release_lock(db_stats, dbentry);
+
+    if (no_pending_stats)
+        pgstat_pending_vacstats = false;
 }
 
 
@@ -1247,50 +908,69 @@ pgstat_collect_oids(Oid catalogid)
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    static List *pending_dbid = NIL;
+    List *left_dbid = NIL;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+    ListCell *lc;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (OidIsValid(databaseid))
+        pending_dbid = lappend_oid(pending_dbid, databaseid);
+    pgstat_pending_dropdb = true;
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    foreach (lc, pending_dbid)
+    {
+        Oid dbid = lfirst_oid(lc);
+
+        /*
+         * Lookup the database in the hashtable.
+         */
+        dbentry = pgstat_get_db_entry(dbid,
+                                      PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                      &status);
+
+        /* skip on lock failure */
+        if (status == PGSTAT_TABLE_LOCK_FAILED)
+        {
+            left_dbid = lappend_oid(left_dbid, dbid);
+            continue;
+        }
+
+        /*
+         * If found, remove it (along with the db statfile).
+         */
+        if (dbentry)
+        {
+            if (dbentry->tables != DSM_HANDLE_INVALID)
+            {
+                dshash_table *tbl =
+                    dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+                dshash_destroy(tbl);
+            }
+            if (dbentry->functions != DSM_HANDLE_INVALID)
+            {
+                dshash_table *tbl =
+                    dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+                dshash_destroy(tbl);
+            }
+
+
+            dshash_delete_entry(db_stats, (void *)dbentry);
+        }
+    }
+
+    list_free(pending_dbid);
+    pending_dbid = left_dbid;
+
+    /*  we're done if no pending database ids */
+    if (list_length(pending_dbid) == 0)
+        pgstat_pending_dropdb = false;
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
@@ -1303,14 +983,59 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    pgstat_pending_resetcounter = true;
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    if (!dbentry)
+    {
+        pgstat_pending_resetcounter = false;
+        return;
+    }
+
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+
+    /*  we're done */
+    pgstat_pending_resetcounter = false;
 }
 
 /* ----------
@@ -1325,23 +1050,53 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    static bool archiver_pending = false;
+    static bool bgwriter_pending = false;
+    bool    have_lock = false;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        archiver_pending = true;
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
+    {
+        bgwriter_pending = true;
+    }
+    else if (target != NULL)
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_resetsharedcounter = true;
+
+    if (db_stats == NULL)
+        return;
+
+    /* Reset the archiver statistics for the cluster. */
+    if (archiver_pending && LWLockConditionalAcquire(StatsLock, LW_EXCLUSIVE))
+    {
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+        archiver_pending = false;
+        have_lock = true;
+    }
+
+    if (bgwriter_pending &&
+        (have_lock || LWLockConditionalAcquire(StatsLock, LW_EXCLUSIVE)))
+    {
+        /* Reset the global background writer statistics for the cluster. */
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+        bgwriter_pending = false;
+        have_lock = true;
+    }
+
+    if (have_lock)
+        LWLockRelease(StatsLock);
+
+    /* notify any pending update  */
+    pgstat_pending_resetsharedcounter =    (archiver_pending || bgwriter_pending);
 }
 
 /* ----------
@@ -1356,17 +1111,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_WRITE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -1380,16 +1155,23 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1403,19 +1185,43 @@ void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1432,9 +1238,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
     /*
@@ -1463,15 +1274,42 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1483,15 +1321,67 @@ pgstat_report_analyze(Relation rel,
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    static int pending_conflict_tablespace = 0;
+    static int pending_conflict_lock = 0;
+    static int pending_conflict_snapshot = 0;
+    static int pending_conflict_bufferpin = 0;
+    static int pending_conflict_startup_deadlock = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+    pgstat_pending_recoveryconflict = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+
+    dshash_release_lock(db_stats, dbentry);
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
@@ -1501,16 +1391,31 @@ pgstat_report_recovery_conflict(int reason)
  * --------
  */
 void
-pgstat_report_deadlock(void)
+pgstat_report_deadlock(bool pending)
 {
-    PgStat_MsgDeadlock msg;
+    static int pending_deadlocks = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1522,34 +1427,36 @@ pgstat_report_deadlock(void)
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    static size_t pending_filesize = 0;
+    static size_t pending_files = 0;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+    pgstat_pending_tempfile = true;
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1670,7 +1577,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
     INSTR_TIME_ADD(fs->f_self_time, f_self);
 
     /* indicate that we have something to send */
-    have_function_stats = true;
+    pgstat_pending_funcstats = true;
 }
 
 
@@ -1691,6 +1598,7 @@ pgstat_initstats(Relation rel)
 {
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
+    MemoryContext oldcontext;
 
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
@@ -1703,7 +1611,18 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Attached shared memory lives for the process lifetime */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    if (db_stats == NULL || !pgstat_track_counts)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -2426,7 +2345,7 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatFuncEntry *funcentry = NULL;
 
     /* Lookup our database, then find the requested function */
-    dbentry = pgstat_get_db_entry(MyDatabaseId, false);
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_READ, NULL);
     if (dbentry == NULL)
         return NULL;
 
@@ -2434,6 +2353,7 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     if (funcentry == NULL)
         return NULL;
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2721,7 +2641,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -4105,49 +4025,6 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
@@ -4158,16 +4035,22 @@ pgstat_send(void *msg, int len)
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
@@ -4180,21 +4063,36 @@ void
 pgstat_send_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+    MemoryContext oldcontext;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4215,8 +4113,6 @@ pgstat_send_bgwriter(void)
 void
 PgstatCollectorMain(void)
 {
-    int            len;
-    PgStat_Msg    msg;
     int            wr;
 
     /*
@@ -4246,20 +4142,6 @@ PgstatCollectorMain(void)
     pgStatRunningInCollector = true;
     pgstat_read_statsfiles();
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
     for (;;)
     {
         /* Clear any already-pending wakeups */
@@ -4272,164 +4154,16 @@ PgstatCollectorMain(void)
             break;
 
         /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
+         * Reload configuration if we got SIGHUP from the postmaster.
          */
-        while (!got_SIGTERM)
+        if (got_SIGHUP)
         {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+        
+        wr = WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1L,
+                       WAIT_EVENT_PGSTAT_MAIN);
 
         /*
          * Emergency bailout if postmaster has died.  This is to avoid the
@@ -4437,8 +4171,8 @@ PgstatCollectorMain(void)
          */
         if (wr & WL_POSTMASTER_DEATH)
             break;
-    }                            /* end of outer loop */
-
+    }
+        
     /*
      * Save the final stats to reuse at next startup.
      */
@@ -4552,29 +4286,62 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
+    bool        nowait = ((op & PGSTAT_TABLE_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+    MemoryContext oldcontext;
 
-    Assert(pgStatRunningInCollector);
+    /* XXXXXXX */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
 
     /* Lookup or create the hash table entry for this database */
-    if (create)
+    if (op & PGSTAT_TABLE_CREATE)
+    {
         result = (PgStat_StatDBEntry *)
-            dshash_find_or_insert(db_stats,    &databaseid, &found);
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
     else
-        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
 
-    if (!create)
-        return result;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_TABLE_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_TABLE_NOT_FOUND;
+        else
+            *status = PGSTAT_TABLE_FOUND;
+    }
 
     return result;
 }
@@ -5646,108 +5413,124 @@ pgstat_clear_snapshot(void)
  *    Count what the backend has done.
  * ----------
  */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+static dshash_table *
+pgstat_update_dbentry(Oid dboid)
 {
-    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
+    pg_stat_table_result_status status;
+    dshash_table *tabhash;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE
+                                  | PGSTAT_TABLE_CREATE
+                                  | PGSTAT_TABLE_NOWAIT,
+                                  &status);
+    
+    /* return if lock failed */
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return NULL;
 
     tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    
+    if (OidIsValid(dboid))
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *)
-            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-        dshash_release_lock(tabhash, tabentry);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Update database-wide stats.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+        dbentry->n_xact_commit += pgStatXactCommit;
+        dbentry->n_xact_rollback += pgStatXactRollback;
+        dbentry->n_block_read_time += pgStatBlockReadTime;
+        dbentry->n_block_write_time += pgStatBlockWriteTime;
     }
 
+    pgStatXactCommit = 0;
+    pgStatXactRollback = 0;
+    pgStatBlockReadTime = 0;
+    pgStatBlockWriteTime = 0;
+
     dshash_release_lock(db_stats, dbentry);
+
+    return tabhash;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, true);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
 
@@ -5757,14 +5540,15 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
  *    Arrange for dead table removal.
  * ----------
  */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
+static bool
+pgstat_tabpurge(Oid dboid, Oid taboid)
 {
     dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
-    int            i;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* wait for lock */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_TABLE_WRITE, NULL);
+
     /*
      * No need to purge if we don't even know the database.
      */
@@ -5772,459 +5556,67 @@ pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
     {
         if (dbentry)
             dshash_release_lock(db_stats, dbentry);
-        return;
+        return true;
     }
 
     tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
-    }
+
+    /* Remove from hashtable if present; we don't care if it's not. */
+    (void) dshash_delete_key(tbl, (void *) &taboid);
 
     dshash_release_lock(db_stats, dbentry);
 
+    return true;
 }
 
 
 /* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        if (dbentry->tables != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-            dshash_destroy(tbl);
-        }
-        if (dbentry->functions != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-            dshash_destroy(tbl);
-        }
-
-        dshash_delete_entry(db_stats, (void *)dbentry);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_destroy(t);
-        dbentry->tables = DSM_HANDLE_INVALID;
-    }
-    if (dbentry->functions != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_destroy(t);
-        dbentry->functions = DSM_HANDLE_INVALID;
-    }
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
-        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
-        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-    else if (msg->m_resettype == RESET_FUNCTION)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++shared_archiverStats->failed_count;
-        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_failed_wal));
-        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++shared_archiverStats->archived_count;
-        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_archived_wal));
-        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
-    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
-    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
-    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
-    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
-    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
-    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
-    shared_globalStats->buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
+ * pgstat_funcstat() -
  *
  *    Count what the backend has done.
  * ----------
  */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
+static bool
+pgstat_update_funcentry(dshash_table *funchash,
+                        PgStat_BackendFunctionEntry *stat)
 {
-    dshash_table *t;
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
-    int            i;
     bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+    funcentry = (PgStat_StatFuncEntry *)
+        dshash_find_or_insert_extended(funchash, (void *) &(stat->f_id),
+                                       &found, true);
 
-    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
+    if (funcentry == NULL)
+        return false;
+
+    if (!found)
     {
-        funcentry = (PgStat_StatFuncEntry *)
-            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-        dshash_release_lock(t, funcentry);
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        funcentry->f_numcalls = stat->f_counts.f_numcalls;
+        funcentry->f_total_time =
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_total_time);
+        funcentry->f_self_time =
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        funcentry->f_numcalls += stat->f_counts.f_numcalls;
+        funcentry->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_total_time);
+        funcentry->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(stat->f_counts.f_self_time);
     }
 
-    dshash_detach(t);
-    dshash_release_lock(db_stats, dbentry);
+    dshash_release_lock(funchash, funcentry);
+
+    return true;
 }
 
 /* ----------
@@ -6233,32 +5625,30 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
  *    Arrange for dead function removal.
  * ----------
  */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+static bool
+pgstat_funcpurge(Oid dboid, Oid funcoid)
 {
     dshash_table *t;
     PgStat_StatDBEntry *dbentry;
-    int            i;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* wait for lock */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_TABLE_WRITE, NULL);
 
     /*
      * No need to purge if we don't even know the database.
      */
     if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
-        return;
+        return true;
 
     t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
-    }
+
+    /* Remove from hashtable if present; we don't care if it's not. */
+    dshash_delete_key(t, (void *) &funcoid);
     dshash_detach(t);
+
     dshash_release_lock(db_stats, dbentry);
+
+    return true;
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..2b64d313b9 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -1130,7 +1130,7 @@ DeadLockReport(void)
                          pgstat_get_backend_current_activity(info->pid, false));
     }
 
-    pgstat_report_deadlock();
+    pgstat_report_deadlock(false);
 
     ereport(ERROR,
             (errcode(ERRCODE_T_R_DEADLOCK_DETECTED),
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..6112e04820 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index afc1927250..ff97d6ab6e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -39,31 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -112,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -177,242 +145,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -453,78 +202,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -1111,7 +788,7 @@ extern char *pgstat_stat_filename;
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1135,8 +812,6 @@ extern void allow_immediate_pgstat_restart(void);
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
 extern void pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
@@ -1154,7 +829,7 @@ extern void pgstat_report_analyze(Relation rel,
                       bool resetcounter);
 
 extern void pgstat_report_recovery_conflict(int reason);
-extern void pgstat_report_deadlock(void);
+extern void pgstat_report_deadlock(bool pending);
 
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
@@ -1328,4 +1003,6 @@ extern void PgstatCollectorMain(void) pg_attribute_noreturn();
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
+extern void pgstat_cleanup_pending_stat(void);
+
 #endif                            /* PGSTAT_H */
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>
> It doesn't work nor even compile since I failed to include some
> changes. The atached v6-0008 at least compiles and words.
> 
> 0001-0007 are not attached since they are still aplicable on
> master head with offsets.

In this patchset 0001-0007 are still the same with the previous
version. I'll reorganize the whole patchset in the next version.

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

- stats collector has been removed.

- modified dshash further so that deletion is allowed during
  sequential scan.

- I'm not sure about the following existing comment at the
  beginning of pgstat.c

  *    - Add a pgstat config column to pg_database, so this
  *      entire thing can be enabled/disabled on a per db basis.


Some points known to need some considerations are:

1. Concurrency is controlled by per-database entry in db_stats
   dshash. It has 127 lock partitions but all backends on the
   same database share just one lock and only one backend takes
   the right to update stats. (Every backend doesn't update stats
   with the interval not shorter than 500ms, like the current
   stats collector.)  Table-stats can be removed by DROP DB
   simultaeously with stats updates so need to block it using
   per-databsae lock. Any locking means other than dshash might
   be needed.

2. Since dshash cannot allow multiple locks because of resize,
   pgstat_update_stat is forced to be a bit ineffecient.
   It loops onver stats list twice, for shared tables and regular
   tables since we can acquire lock on one database at once.

   Maybe providing individual TabStatusArray for the two will fix
   it, will do in the next version.

3. This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
   similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
   at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
   statistics update.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e3d415d4a3d691d2bd7864fdb1ec088d445209e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 5 Oct 2018 16:57:40 +0900
Subject: [PATCH 10/10] Remove stats collector process

Since stats are directly updated by every backend stats collector is
no longer useful. Remove it.
---
 src/backend/bootstrap/bootstrap.c   |   8 --
 src/backend/postmaster/pgstat.c     | 205 ++++--------------------------------
 src/backend/postmaster/postmaster.c |  78 ++------------
 src/backend/storage/ipc/dsm.c       |  15 ++-
 src/include/pgstat.h                |   7 +-
 src/include/storage/dsm.h           |   1 +
 6 files changed, 42 insertions(+), 272 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index ece200877c..578af2e66d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -336,9 +336,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
-            case StatsCollectorProcess:
-                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
-                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -473,11 +470,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
-        case StatsCollectorProcess:
-            /* don't set signals, stats collector has its own agenda */
-            PgstatCollectorMain();
-            proc_exit(1);        /* should never return */
-
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index cef3097da9..c8818af5b9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -129,8 +129,6 @@ PgStat_BgWriter BgWriterStats;
  * Local data
  * ----------
  */
-static time_t last_pgstat_start_time;
-
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
     dsa_handle stats_dsa_handle;
@@ -286,11 +284,6 @@ static PgStat_ArchiverStats *snapshot_archiverStats;
 static PgStat_GlobalStats *shared_globalStats;
 static PgStat_GlobalStats *snapshot_globalStats;
 
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
-static volatile bool got_SIGTERM = false;
-
 /*
  * Total time charged to functions so far in the current backend.
  * We use this to help separate "self" and "other" time charges.
@@ -303,14 +296,8 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-static void pgstat_shutdown_handler(SIGNAL_ARGS);
-static void pgstat_quickdie_handler(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
                                     pg_stat_table_result_status *status);
@@ -357,27 +344,25 @@ static bool pgstat_update_tabentry(dshash_table *tabhash,
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
                                   PgStat_TableStatus *stat);
 
-/* file in/out functions */
-static void pgstat_read_statsfiles(void);
-static void pgstat_write_statsfiles(void);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
 /* ----------
- * pgstat_init() -
+ * pgstat_child_init() -
  *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
+ *    Called from InitPostmasterChild at process startup. Initialize static
+ *    variables.
  * ----------
  */
 void
-pgstat_init(void)
+pgstat_child_init(void)
 {
+    StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    area = NULL;
+
+    return;
 }
 
 /*
@@ -437,12 +422,6 @@ pgstat_reset_all(void)
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
-
 /* ----------
  * pgstat_attach_shared_stats() -
  *
@@ -4384,154 +4363,6 @@ pgstat_update_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-void
-PgstatCollectorMain(void)
-{
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, pgstat_shutdown_handler);
-    pqsignal(SIGQUIT, pgstat_quickdie_handler);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    pqsignal(SIGCHLD, SIG_DFL);
-    pqsignal(SIGTTIN, SIG_DFL);
-    pqsignal(SIGTTOU, SIG_DFL);
-    pqsignal(SIGCONT, SIG_DFL);
-    pqsignal(SIGWINCH, SIG_DFL);
-
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgstat_read_statsfiles();
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (got_SIGTERM)
-            break;
-
-        /*
-         * Reload configuration if we got SIGHUP from the postmaster.
-         */
-        if (got_SIGHUP)
-        {
-            got_SIGHUP = false;
-            ProcessConfigFile(PGC_SIGHUP);
-        }
-
-        wr = WaitLatch(MyLatch,
-                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
-                       -1L, WAIT_EVENT_PGSTAT_MAIN);
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles();
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_quickdie_handler(SIGNAL_ARGS)
-{
-    PG_SETMASK(&BlockSig);
-
-    /*
-     * We DO NOT want to run proc_exit() callbacks -- we're here because
-     * shared memory may be corrupted, so we don't want to try to clean up our
-     * transaction.  Just nail the windows shut and get out of town.  Now that
-     * there's an atexit callback to prevent third-party code from breaking
-     * things by calling exit() directly, we have to reset the callbacks
-     * explicitly to make this work as intended.
-     */
-    on_exit_reset();
-
-    /*
-     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
-     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
-     * backend.  This is necessary precisely because we don't clean up our
-     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
-     * should ensure the postmaster sees this as a crash, too, but no harm in
-     * being doubly sure.)
-     */
-    exit(2);
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-static void
-pgstat_shutdown_handler(SIGNAL_ARGS)
-{
-    int save_errno = errno;
-
-    got_SIGTERM = true;
-
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
  * Subroutine to reset stats in a shared database entry
  *
@@ -4700,7 +4531,7 @@ pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
  *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
+void
 pgstat_write_statsfiles(void)
 {
     dshash_seq_status hstat;
@@ -4711,8 +4542,8 @@ pgstat_write_statsfiles(void)
     const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
-    /* should be called in stats collector  */
-    Assert(pgStatRunningInCollector);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4855,8 +4686,6 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
     char        statfile[MAXPGPATH];
     dshash_table *tbl;
 
-    Assert(pgStatRunningInCollector);
-
     get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
     get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
@@ -4950,7 +4779,7 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
  *
  * ----------
  */
-static void
+void
 pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
@@ -4962,7 +4791,9 @@ pgstat_read_statsfiles(void)
     dshash_table *tblstats = NULL;
     dshash_table *funcstats = NULL;
 
-    Assert(pgStatRunningInCollector);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     /*
      * The tables will live in pgStatLocalContext.
      */
@@ -5173,7 +5004,9 @@ pgstat_read_db_statsfile(Oid databaseid,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    Assert(pgStatRunningInCollector);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 798df0767d..edf6e64e25 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -145,8 +145,7 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
+#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -254,7 +253,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -552,7 +550,6 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
-#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1304,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1381,6 +1372,9 @@ PostmasterMain(int argc, char *argv[])
     /* Some workers may be scheduled to start now */
     maybe_start_bgworkers();
 
+    /* Activate stats collector facility */
+    pgstat_read_statsfiles();
+
     status = ServerLoop();
 
     /*
@@ -1762,11 +1756,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = StartStatsCollector();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = pgarch_start();
@@ -2565,8 +2554,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2897,8 +2884,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
-            if (PgStatPID == 0)
-                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2970,8 +2955,7 @@ reaper(SIGNAL_ARGS)
                  * We can also shut down the stats collector now; there's
                  * nothing left for it to do.
                  */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGTERM);
+                pgstat_write_statsfiles();
             }
             else
             {
@@ -3048,22 +3032,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                HandleChildCrash(pid, exitstatus,
-                                 _("statistics collector process"));
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = StartStatsCollector();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3510,22 +3478,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3729,8 +3681,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3769,8 +3719,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3970,8 +3919,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4969,16 +4916,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         PgArchiverMain(argc, argv); /* does not return */
     }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
-        InitProcess();
-
-        /* Attach process to shared data structures */
-        CreateSharedMemoryAndSemaphores(false, 0);
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5094,8 +5031,7 @@ sigusr1_handler(SIGNAL_ARGS)
         /*
          * Likewise, start other special children as needed.
          */
-        Assert(PgStatPID == 0);
-        PgStatPID = StartStatsCollector();
+        pgstat_read_statsfiles();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index ff0ad4cce3..32508353b7 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -432,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -449,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -546,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5db631c54c..dad3440568 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -800,10 +800,8 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern void pgstat_child_init(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
 
 /* ----------
  * Functions called from backends
@@ -1005,4 +1003,7 @@ extern void CreateSharedBackendStatus(void);
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 3c3da7ace6..be50f43db0 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -34,6 +34,7 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
 /* Functions that create, update, or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
-- 
2.16.3

From 68d608a77661d64625247fd04d89d9447b36046a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 4 Oct 2018 09:26:18 +0900
Subject: [PATCH 09/10] Full shared-memory based stats collector

The version of this patchset at the moment still uses socket as the
channel to collect statistics. However the stats collector process is
the only writer of the shared memory in the way and we don't need to
consider of locking among multiple writers, it is inneficient.

With this patch every backend directly writres to the stats tables. A
updator backend must aquire lock on database stats entry and decline
on lock failure. In regular backends, stats updates don't happen with
the interval shorter than about PGSTAT_STAT_MIN_INTERVAL(500)ms as the
previous stats collector did. On the other hand if pending updates
lasts for longer than PGSTAT_STAT_MAX_INTERVAL(1000)ms, the pending
updates are forcibly applied.
---
 src/backend/access/transam/xlog.c           |    4 +-
 src/backend/postmaster/autovacuum.c         |   16 +-
 src/backend/postmaster/bgwriter.c           |    4 +-
 src/backend/postmaster/checkpointer.c       |   24 +-
 src/backend/postmaster/pgarch.c             |    4 +-
 src/backend/postmaster/pgstat.c             | 2872 +++++++++++----------------
 src/backend/postmaster/postmaster.c         |   10 +
 src/backend/replication/logical/tablesync.c |    9 +-
 src/backend/replication/logical/worker.c    |    4 +-
 src/backend/storage/buffer/bufmgr.c         |    8 +-
 src/backend/storage/ipc/dsm.c               |    9 +
 src/backend/tcop/postgres.c                 |   27 +-
 src/backend/utils/adt/pgstatfuncs.c         |   44 +-
 src/backend/utils/init/globals.c            |    1 +
 src/backend/utils/init/postinit.c           |   11 +
 src/include/miscadmin.h                     |    1 +
 src/include/pgstat.h                        |  381 +---
 src/include/storage/dsm.h                   |    1 +
 src/include/utils/timeout.h                 |    1 +
 src/test/modules/worker_spi/worker_spi.c    |    2 +-
 20 files changed, 1361 insertions(+), 2072 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffc..980c7e9e0e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8604,9 +8604,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 65956c0c35..10e707e9a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -977,7 +977,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = backend_get_db_entry(newdb, true);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -1002,7 +1002,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = backend_get_db_entry(avdb->adl_datid, true);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1027,7 +1027,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = backend_get_db_entry(avdb->adw_datid, true);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1238,7 +1238,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = backend_get_db_entry(tmp->adw_datid, true);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1980,7 +1980,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = backend_get_db_entry(MyDatabaseId, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2030,7 +2030,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = backend_get_db_entry(InvalidOid, true);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2805,8 +2805,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = backend_get_db_entry(InvalidOid, true);
-    dbentry = backend_get_db_entry(MyDatabaseId, true);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c53..a4b1079e60 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -271,9 +271,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..9235390bc6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -376,7 +376,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -402,7 +402,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -520,13 +520,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -694,9 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1296,8 +1296,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 885e85ad8a..3ca36d62a4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -466,7 +466,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -476,7 +476,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a3d5f4856f..cef3097da9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -74,16 +69,11 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
@@ -92,6 +82,20 @@
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_TABLE_READ    0
+#define PGSTAT_TABLE_WRITE    1
+#define PGSTAT_TABLE_CREATE 2
+#define    PGSTAT_TABLE_NOWAIT 4
+
+typedef enum
+{
+    PGSTAT_TABLE_NOT_FOUND,
+    PGSTAT_TABLE_FOUND,
+    PGSTAT_TABLE_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -119,20 +123,14 @@ int            pgstat_track_activity_query_size = 1024;
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
 static time_t last_pgstat_start_time;
 
-static bool pgStatRunningInCollector = false;
-
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
     dsa_handle stats_dsa_handle;
@@ -147,6 +145,16 @@ static dshash_table *db_stats;
 static HTAB *snapshot_db_stats;
 static MemoryContext stats_cxt;
 
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -205,18 +213,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -253,6 +257,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -276,14 +286,6 @@ static PgStat_ArchiverStats *snapshot_archiverStats;
 static PgStat_GlobalStats *shared_globalStats;
 static PgStat_GlobalStats *snapshot_globalStats;
 
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
@@ -305,17 +307,15 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
 static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
-static void pgstat_write_statsfiles(void);
 static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
-static void pgstat_read_statsfiles(void);
 static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
 
 /* functions used in backends */
@@ -323,10 +323,25 @@ static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid);
+static void pgstat_update_loop(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_update_pending_tabhash(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
 
+static inline void pgstat_update_funcentry_by_backendstats(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_update_funcentry_by_stats(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
+
+static HTAB *pgstat_collect_oids(Oid catalogid);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -337,25 +352,14 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* file in/out functions */
+static void pgstat_read_statsfiles(void);
+static void pgstat_write_statsfiles(void);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
@@ -374,280 +378,6 @@ static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
 void
 pgstat_init(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
 }
 
 /*
@@ -713,226 +443,6 @@ allow_immediate_pgstat_restart(void)
     last_pgstat_start_time = 0;
 }
 
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_attach_shared_stats() -
  *
@@ -1008,6 +518,564 @@ pgstat_create_shared_stats(void)
     LWLockRelease(StatsLock);
 }
 
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
+ */
+long
+pgstat_update_stat(bool force)
+{
+    /* we assume this inits to all zeroes: */
+    static TimestampTz last_report = 0;
+    static TimestampTz oldest_pending = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        now = GetCurrentTransactionStopTimestamp();
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    last_report = now;
+
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId,
+                                PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    /*
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
+     */
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_update_loop(false, force, &cxt);
+    pgstat_update_loop(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+static void
+pgstat_update_loop(bool shared, bool force, pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
+
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            PgStat_TableStatus *pentry = NULL;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) == 0)
+                continue;
+
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
+            {
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_update_pending_tabhash(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try it */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
+            }
+        }
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared statistics using given entry
+ *
+ * If nowait is true, skips work and returns false on lock failure.
+ * Table stats dshash and function stats dshash are kept attached and stored
+ * in ctx. The caller must detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE;
+    bool updated = false;
+
+    /* fix table search mode */
+    if (nowait)
+        table_mode |= PGSTAT_TABLE_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for dbentries for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
+    }
+
+    /*
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
+     */
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
+
+    return updated;
+}
+
+/*
+ * pgstat_update_pending_tabhash:
+ *
+ * Updates deststat by adding srcstat. Existing value in deststat is cleard if
+ * init is true.
+ */
+static void
+pgstat_update_pending_tabhash(PgStat_TableStatus *deststat,
+                              PgStat_TableStatus *srcstat,
+                              bool init)
+{
+    Assert (deststat != srcstat);
+
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
+    else
+    {
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
+    }
+}
+        
+/*
+ * Subroutine for pgstat_update_stat: update a function stat
+ */
+static void
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE;
+
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
+        return;
+
+    if (nowait)
+        table_op += PGSTAT_TABLE_NOWAIT;
+
+    /* find the shared function stats table */
+    if (!dbentry)
+    {
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
+    }
+
+    /* lock failure, return. */
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *srcent;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((srcent = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *destent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&srcent->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                destent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(srcent->f_id), HASH_FIND, NULL);
+
+            if (!destent)
+            {
+                /* pending entry not found, find shared stats entry */
+                destent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(srcent->f_id),
+                                                   &found, nowait);
+                if (destent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    destent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(srcent->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (destent != NULL);
+
+            pgstat_update_funcentry_by_backendstats(destent, srcent, init);
+
+            /* reset used counts */
+            MemSet(&srcent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *srcent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((srcent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *destent;
+            bool found;
+
+            destent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(srcent->functionid),
+                                               &found, nowait);
+            if (destent)
+            {
+                pgstat_update_funcentry_by_stats(srcent, destent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(srcent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
+}
+
+static inline void
+pgstat_update_funcentry_by_backendstats(PgStat_StatFuncEntry *dest,
+                                        PgStat_BackendFunctionEntry *src,
+                                        bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+static inline void
+pgstat_update_funcentry_by_stats(PgStat_StatFuncEntry *dest,
+                                 PgStat_StatFuncEntry *src,
+                                 bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
+
 /* ----------
  * pgstat_vacuum_stat() -
  *
@@ -1018,17 +1086,11 @@ void
 pgstat_vacuum_stat(void)
 {
     HTAB       *oidtab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
     dshash_table *dshtable;
     dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
     /* If not done for this transaction, take a snapshot of stats */
     if (!backend_snapshot_global_stats())
@@ -1044,7 +1106,7 @@ pgstat_vacuum_stat(void)
      * collector to drop them.
      */
 
-    dshash_seq_init(&dshstat, db_stats, true);
+    dshash_seq_init(&dshstat, db_stats, true, true);
     while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
@@ -1063,8 +1125,8 @@ pgstat_vacuum_stat(void)
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = backend_get_db_entry(MyDatabaseId, true);
-    if (dbentry == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_WRITE, NULL);
+    if (!dbentry)
         return;
     
     /*
@@ -1072,17 +1134,12 @@ pgstat_vacuum_stat(void)
      */
     oidtab = pgstat_collect_oids(RelationRelationId);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
     /*
      * Check for all tables listed in stats hashtable if they still exist.
      * Stats cache is useless here so directly search the shared hash.
      */
     dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    dshash_seq_init(&dshstat, dshtable, false);
+    dshash_seq_init(&dshstat, dshtable, false, true);
     while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
@@ -1092,41 +1149,11 @@ pgstat_vacuum_stat(void)
         if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
     dshash_detach(dshtable);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
     /* Clean up */
     hash_destroy(oidtab);
 
@@ -1139,11 +1166,7 @@ pgstat_vacuum_stat(void)
     {
         oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        dshash_seq_init(&dshstat, dshtable, false);
+        dshash_seq_init(&dshstat, dshtable, false, true);
         while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
@@ -1153,39 +1176,14 @@ pgstat_vacuum_stat(void)
             if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
         hash_destroy(oidtab);
     }
     dshash_detach(dshtable);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1240,57 +1238,50 @@ pgstat_collect_oids(Oid catalogid)
  * pgstat_drop_database() -
  *
  *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *  (If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().)
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_TABLE_WRITE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
@@ -1303,14 +1294,46 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_WRITE, &status);
+
+    if (!dbentry)
+        return;
+
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -1325,23 +1348,32 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (db_stats == NULL)
         return;
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1356,17 +1388,38 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_WRITE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -1380,16 +1433,23 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* Don't defer */
+
+    if (db_stats == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -1403,19 +1463,43 @@ void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1432,9 +1516,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Don't defer */
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
     /*
@@ -1463,15 +1552,42 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE,
+                                  NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
@@ -1480,18 +1596,81 @@ pgstat_report_analyze(Relation rel,
  *    Tell the collector about a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    pgstat_pending_recoveryconflict = false;
+
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+    {
+        pgstat_pending_recoveryconflict = true;
+        return;
+    }
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
@@ -1500,17 +1679,38 @@ pgstat_report_recovery_conflict(int reason)
  *    Tell the collector about a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+    dshash_release_lock(db_stats, dbentry);
+    pgstat_pending_deadlock = false;
+}
+
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
 }
 
 /* --------
@@ -1519,40 +1719,51 @@ pgstat_report_deadlock(void)
  *    Tell the collector about a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    pgstat_pending_tempfile = false;
+    if (db_stats == NULL || !pgstat_track_counts)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_TABLE_WRITE | PGSTAT_TABLE_CREATE |
+                                  PGSTAT_TABLE_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_TABLE_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
+static void
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgDummy msg;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1668,9 +1879,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1691,6 +1899,7 @@ pgstat_initstats(Relation rel)
 {
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
+    MemoryContext oldcontext;
 
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
@@ -1703,7 +1912,18 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* Attached shared memory lives for the process lifetime */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    if (db_stats == NULL || !pgstat_track_counts)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -2353,26 +2573,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = backend_get_db_entry(dbid, false);
-    return dbentry;
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2389,7 +2589,7 @@ pgstat_fetch_stat_tabentry(Oid relid)
     PgStat_StatTabEntry *tabentry;
 
     /* Lookup our database, then look in its table hash table. */
-    dbentry = backend_get_db_entry(MyDatabaseId, false);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
     if (dbentry == NULL)
         return NULL;
 
@@ -2400,7 +2600,7 @@ pgstat_fetch_stat_tabentry(Oid relid)
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbentry = backend_get_db_entry(InvalidOid, false);
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
     if (dbentry == NULL)
         return NULL;
 
@@ -2426,7 +2626,7 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatFuncEntry *funcentry = NULL;
 
     /* Lookup our database, then find the requested function */
-    dbentry = pgstat_get_db_entry(MyDatabaseId, false);
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_TABLE_READ, NULL);
     if (dbentry == NULL)
         return NULL;
 
@@ -2434,6 +2634,7 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     if (funcentry == NULL)
         return NULL;
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2721,7 +2922,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2921,7 +3122,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3188,7 +3389,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4105,96 +4307,76 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+    MemoryContext oldcontext;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    /* Attached shared memory lives for the process lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    while (!pgstat_attach_shared_stats())
+        sleep(1);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4215,8 +4397,6 @@ pgstat_send_bgwriter(void)
 void
 PgstatCollectorMain(void)
 {
-    int            len;
-    PgStat_Msg    msg;
     int            wr;
 
     /*
@@ -4272,164 +4452,17 @@ PgstatCollectorMain(void)
             break;
 
         /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
+         * Reload configuration if we got SIGHUP from the postmaster.
          */
-        while (!got_SIGTERM)
+        if (got_SIGHUP)
         {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
 
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
+        wr = WaitLatch(MyLatch,
+                       WL_LATCH_SET | WL_POSTMASTER_DEATH,
+                       -1L, WAIT_EVENT_PGSTAT_MAIN);
 
         /*
          * Emergency bailout if postmaster has died.  This is to avoid the
@@ -4552,29 +4585,62 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
+    bool        nowait = ((op & PGSTAT_TABLE_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+    MemoryContext oldcontext;
 
-    Assert(pgStatRunningInCollector);
+    /* XXXXXXX */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
+    {
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
 
     /* Lookup or create the hash table entry for this database */
-    if (create)
+    if (op & PGSTAT_TABLE_CREATE)
+    {
         result = (PgStat_StatDBEntry *)
-            dshash_find_or_insert(db_stats,    &databaseid, &found);
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
     else
-        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
 
-    if (!create)
-        return result;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_TABLE_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_TABLE_NOT_FOUND;
+        else
+            *status = PGSTAT_TABLE_FOUND;
+    }
 
     return result;
 }
@@ -4631,11 +4697,7 @@ pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
 static void
@@ -4649,6 +4711,9 @@ pgstat_write_statsfiles(void)
     const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called in stats collector  */
+    Assert(pgStatRunningInCollector);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4691,7 +4756,7 @@ pgstat_write_statsfiles(void)
     /*
      * Walk through the database table.
      */
-    dshash_seq_init(&hstat, db_stats, false);
+    dshash_seq_init(&hstat, db_stats, false, false);
     while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
@@ -4744,13 +4809,6 @@ pgstat_write_statsfiles(void)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4797,6 +4855,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
     char        statfile[MAXPGPATH];
     dshash_table *tbl;
 
+    Assert(pgStatRunningInCollector);
+
     get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
     get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
@@ -4826,7 +4886,7 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
      * Walk through the database's access stats per table.
      */
     tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    dshash_seq_init(&tstat, tbl, false);
+    dshash_seq_init(&tstat, tbl, false, false);
     while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
@@ -4839,7 +4899,7 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
      * Walk through the database's function stats table.
      */
     tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    dshash_seq_init(&fstat, tbl, false);
+    dshash_seq_init(&fstat, tbl, false, false);
     while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
@@ -4932,7 +4992,7 @@ pgstat_read_statsfiles(void)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -4945,7 +5005,7 @@ pgstat_read_statsfiles(void)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -4955,7 +5015,7 @@ pgstat_read_statsfiles(void)
      */
     if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
@@ -4975,7 +5035,7 @@ pgstat_read_statsfiles(void)
      */
     if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
@@ -4997,7 +5057,7 @@ pgstat_read_statsfiles(void)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5012,7 +5072,7 @@ pgstat_read_statsfiles(void)
                 if (found)
                 {
                     dshash_release_lock(db_stats, dbentry);
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5021,6 +5081,8 @@ pgstat_read_statsfiles(void)
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
                 dbentry->tables = DSM_HANDLE_INVALID;
                 dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5028,7 +5090,6 @@ pgstat_read_statsfiles(void)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                Assert(pgStatRunningInCollector);
                 dbentry->stats_timestamp = 0;
 
                 /*
@@ -5051,7 +5112,7 @@ pgstat_read_statsfiles(void)
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5127,7 +5188,7 @@ pgstat_read_db_statsfile(Oid databaseid,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5140,7 +5201,7 @@ pgstat_read_db_statsfile(Oid databaseid,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5160,7 +5221,7 @@ pgstat_read_db_statsfile(Oid databaseid,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5179,7 +5240,7 @@ pgstat_read_db_statsfile(Oid databaseid,
                 if (found)
                 {
                     dshash_release_lock(tabhash, tabentry);
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5196,7 +5257,7 @@ pgstat_read_db_statsfile(Oid databaseid,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5214,7 +5275,7 @@ pgstat_read_db_statsfile(Oid databaseid,
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5231,7 +5292,7 @@ pgstat_read_db_statsfile(Oid databaseid,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5439,7 +5500,7 @@ snapshot_statentry_all(const char *hashname,
     dest = create_local_stats_hash(hashname,
                                     keysize, entrysize, num_entries);
 
-    dshash_seq_init(&s, t, true);
+    dshash_seq_init(&s, t, true, false);
     while ((ps = dshash_seq_next(&s)) != NULL)
     {
         bool found;
@@ -5473,8 +5534,6 @@ backend_snapshot_global_stats(void)
     if (snapshot_globalStats)
         return true;
 
-    Assert(!pgStatRunningInCollector);
-
     /* Attached shared memory lives for the process lifetime */
     oldcontext = MemoryContextSwitchTo(TopMemoryContext);
     if (!pgstat_attach_shared_stats())
@@ -5519,25 +5578,31 @@ backend_snapshot_global_stats(void)
 }
 
 /* ----------
- * backend_get_db_entry() -
+ * pgstat_fetch_stat_dbentry() -
  *
  *    Find database stats entry on backends. The returned entries are cached
  *    until transaction end. If onshot is true, they are not cached and returned
  *    in a palloc'ed memory.
  */
 PgStat_StatDBEntry *
-backend_get_db_entry(Oid dbid, bool oneshot)
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
 {
     /* take a local snapshot if we don't have one */
     char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
 
     /* If not done for this transaction, take a snapshot of global stats */
     if (!backend_snapshot_global_stats())
         return NULL;
 
-    return snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
-                              hashname, db_stats, 0, &dsh_dbparams,
-                              dbid);
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
 }
 
 /* ----------
@@ -5551,6 +5616,9 @@ backend_snapshot_all_db_entries(void)
     /* take a local snapshot if we don't have one */
     char *hashname = "local database stats hash";
 
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
     /* If not done for this transaction, take a snapshot of global stats */
     if (!backend_snapshot_global_stats())
         return NULL;
@@ -5570,6 +5638,10 @@ backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
 {
     /* take a local snapshot if we don't have one */
     char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
     return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
                               hashname, NULL, dbent->tables, &dsh_tblparams,
                               reloid);
@@ -5586,6 +5658,10 @@ static PgStat_StatFuncEntry *
 backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
 {
     char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
     return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
                               hashname, NULL, dbent->functions, &dsh_funcparams,
                               funcid);
@@ -5640,627 +5716,103 @@ pgstat_clear_snapshot(void)
 }
 
 
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    dshash_table *tabhash;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
+    bool    found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *)
-            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-        dshash_release_lock(tabhash, tabentry);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
 
-    dshash_release_lock(db_stats, dbentry);
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    dshash_table *tbl;
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
-    {
-        if (dbentry)
-            dshash_release_lock(db_stats, dbentry);
-        return;
-    }
-
-    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        if (dbentry->tables != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-            dshash_destroy(tbl);
-        }
-        if (dbentry->functions != DSM_HANDLE_INVALID)
-        {
-            dshash_table *tbl =
-                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-            dshash_destroy(tbl);
-        }
-
-        dshash_delete_entry(db_stats, (void *)dbentry);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_destroy(t);
-        dbentry->tables = DSM_HANDLE_INVALID;
-    }
-    if (dbentry->functions != DSM_HANDLE_INVALID)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_destroy(t);
-        dbentry->functions = DSM_HANDLE_INVALID;
-    }
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
-        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
-        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-    else if (msg->m_resettype == RESET_FUNCTION)
-    {
-        dshash_table *t =
-            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-        dshash_delete_key(t, (void *) &(msg->m_objectid));
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
     else
     {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    dshash_table *table;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
-    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-    dshash_release_lock(table, tabentry);
-    dshash_detach(table);
-    dshash_release_lock(db_stats, dbentry);
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++shared_archiverStats->failed_count;
-        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_failed_wal));
-        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++shared_archiverStats->archived_count;
-        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
-               sizeof(shared_archiverStats->last_archived_wal));
-        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
-    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
-    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
-    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
-    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
-    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
-    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
-    shared_globalStats->buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-
-    dshash_release_lock(db_stats, dbentry);
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    dshash_table *t;
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *)
-            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
-
-        if (!found)
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-        dshash_release_lock(t, funcentry);
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
 
-    dshash_detach(t);
-    dshash_release_lock(db_stats, dbentry);
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
-    dshash_table *t;
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
-     * No need to purge if we don't even know the database.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
-        return;
-
-    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
-    }
-    dshash_detach(t);
-    dshash_release_lock(db_stats, dbentry);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 54f39ec9bb..798df0767d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -4969,6 +4969,16 @@ SubPostmasterMain(int argc, char *argv[])
 
         PgArchiverMain(argc, argv); /* does not return */
     }
+    if (strcmp(argv[1], "--forkcol") == 0)
+    {
+        /* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+        InitProcess();
+
+        /* Attach process to shared data structures */
+        CreateSharedMemoryAndSemaphores(false, 0);
+
+        PgstatCollectorMain(argc, argv);    /* does not return */
+    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6e420d893c..862582da23 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -127,7 +127,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -143,6 +143,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -533,7 +536,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -876,7 +879,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e247539d7b..9719257793 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -491,7 +491,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1324,6 +1324,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d", 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 9629f22f7a..ff0ad4cce3 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e4c6e3d406..60296a4cef 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3123,6 +3123,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3697,6 +3703,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4137,9 +4144,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4174,7 +4189,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4182,6 +4197,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..a5b3323f12 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 5971310aab..73fa583425 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..1e4fa89135 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 433d1ed0eb..9cd31fae7b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -81,6 +81,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index afc1927250..5db631c54c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -39,31 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -112,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -177,242 +145,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -453,78 +202,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -1111,7 +788,7 @@ extern char *pgstat_stat_filename;
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1123,9 +800,6 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
 extern void pgstat_init(void);
 extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
@@ -1135,17 +809,14 @@ extern void allow_immediate_pgstat_restart(void);
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1156,6 +827,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1305,15 +978,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1322,9 +995,13 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-/* Main loop */
-extern void PgstatCollectorMain(void) pg_attribute_noreturn();
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 169de946f7..3c3da7ace6 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..b8a56645b6 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 0d705a3f2e..da488ebfd4 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -295,7 +295,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From 1738edcfed77f5cb709f3c200f1db2d2837d853d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 3 Oct 2018 17:10:46 +0900
Subject: [PATCH 08/10] Allow delettion while dshash sequential scan

The previous version of dshash_seq_scan is not allow deletion while
scanning. This patch allows that in the same way with dynahash.
---
 src/backend/lib/dshash.c | 89 +++++++++++++++++++++++++++++++++++-------------
 src/include/lib/dshash.h |  4 ++-
 2 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 7584931515..e71d189ab6 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -153,6 +154,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +233,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +285,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -599,9 +606,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -618,6 +630,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* don't allow during sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -664,14 +678,20 @@ dshash_memhash(const void *v, size_t size, void *arg)
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
-                bool consistent)
+                bool consistent, bool exclusive)
 {
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
     status->hash_table = hash_table;
     status->curbucket = 0;
     status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
     status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
     status->curpartition = -1;
     status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
 
     /*
      * Protect all partitions from modification if the caller wants a
@@ -685,7 +705,8 @@ dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
         {
             Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
 
-            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
         }
     }
     ensure_valid_bucket_pointers(hash_table);
@@ -696,17 +717,29 @@ dshash_seq_next(dshash_seq_status *status)
 {
     dsa_pointer next_item_pointer;
 
+    Assert(status->hash_table->seqscan_running);
     if (status->curitem == NULL)
     {
+        int partition;
+
         Assert (status->curbucket == 0);
         Assert(!status->hash_table->find_locked);
 
         /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+        }
+        
         next_item_pointer = status->hash_table->buckets[status->curbucket];
-        status->hash_table->find_locked = true;
     }
     else
-        next_item_pointer = status->curitem->next;
+        next_item_pointer = status->pnextitem;
 
     /* Move to the next bucket if we finished the current bucket */
     while (!DsaPointerIsValid(next_item_pointer))
@@ -717,42 +750,50 @@ dshash_seq_next(dshash_seq_status *status)
             dshash_seq_term(status);
             return NULL;
         }
-        Assert(status->hash_table->find_locked);
 
-        next_item_pointer = status->hash_table->buckets[status->curbucket];
-
-        /*
-         * we need a lock on the scanning partition even if the caller don't
-         * requested a consistent snapshot.
-         */
-        if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
         {
-            dshash_table_item  *item = dsa_get_address(status->hash_table->area,
-                                                       next_item_pointer);
-            int next_partition = PARTITION_FOR_HASH(item->hash);
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
             if (status->curpartition != next_partition)
             {
-                if (status->curpartition >= 0)
-                    LWLockRelease(PARTITION_LOCK(status->hash_table,
-                                                 status->curpartition));
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
                 LWLockAcquire(PARTITION_LOCK(status->hash_table,
                                              next_partition),
-                              LW_SHARED);
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
                 status->curpartition = next_partition;
             }
         }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
     }
 
     status->curitem =
         dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
     return ENTRY_FROM_ITEM(status->curitem);
 }
 
 void
 dshash_seq_term(dshash_seq_status *status)
 {
-    Assert(status->hash_table->find_locked);
+    Assert(status->hash_table->seqscan_running);
     status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
 
     if (status->consistent)
     {
@@ -773,7 +814,7 @@ dshash_get_num_entries(dshash_table *hash_table)
     void *p;
     int n = 0;
 
-    dshash_seq_init(&s, hash_table, false);
+    dshash_seq_init(&s, hash_table, false, false);
     while ((p = dshash_seq_next(&s)) != NULL)
         n++;
 
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b207585eeb..1b3e01000b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -71,8 +71,10 @@ typedef struct dshash_seq_status
     int                    curbucket;
     int                    nbuckets;
     dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
     int                    curpartition;
     bool                consistent;
+    bool                exclusive;
 } dshash_seq_status;
 
 /* Creating, sharing and destroying from hash tables. */
@@ -101,7 +103,7 @@ extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
 /* seq scan support */
 extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
-                            bool exclusive);
+                            bool consistent, bool exclusive);
 extern void *dshash_seq_next(dshash_seq_status *status);
 extern void dshash_seq_term(dshash_seq_status *status);
 extern int dshash_get_num_entries(dshash_table *hash_table);
-- 
2.16.3


Re: shared-memory based stats collector

От
Antonin Houska
Дата:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> This is more saner version of previous v5-0008, which didn't pass
> regression test. v6-0008 to v6-0010 are attached and they are
> applied on top of v5-0001-0007.
>
> - stats collector has been removed.
>
> - modified dshash further so that deletion is allowed during
>   sequential scan.
>
> - I'm not sure about the following existing comment at the
>   beginning of pgstat.c
>
>   *    - Add a pgstat config column to pg_database, so this
>   *      entire thing can be enabled/disabled on a per db basis.

Following is the next handful of my comments:

* If you remove the stats collector, I think the remaining code in pgstat.c
  does no longer fit into the backend/postmaster/ directory.

* I'm not sure it's o.k. to call pgstat_write_statsfiles() from
  postmaster.c:reaper(): the function can raise ERROR (I see at least one code
  path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
  a signal handler, it's hard to imagine the consequences. Maybe a reason to
  leave some functionality in a separate worker, although the "stats
  collector" would have to be changed.

* Question still remains whether all the statistics should be loaded into
  shared memory, see the note on paging near the bottom of [1].

* if dshash_seq_init() is passed consistent=false, shouldn't we call
  ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
  needs to access the next partition and the old partition lock got released,
  the table can be resized before the next partition lock is acquired, and
  thus the backend-local copy of buckets becomes obsolete.

* Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
  to be used in the current patch version.

* pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

* pgstat_get_db_entry(): "return false" should probably be "return NULL".

* Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
  couldn't find a place where it's value is tested.

* dshash_seq_init(): does it need to be called with consistent=true from
  pgstat_vacuum_stat() when the the entries returned by the scan are just
  dropped?

    dshash_seq_init(&dshstat, db_stats, true, true);

I suspect this is a thinko because another call from the same function looks
like

    dshash_seq_init(&dshstat, dshtable, false, true);

* I'm not sure about usefulness of dshash_get_num_entries(). It passes
  consistent=false to dshash_seq_init(), so the number of entries can change
  during the execution. And even if the function requested a "consistent
  scan", entries can be added / removed as soon as the scan is over.

  If the returned value is used to decide whether the hashtable should be
  scanned or not, I think you don't break anything if you simply start the
  scan unconditionally and see if you find some entries.

  And if you really need to count the entries, I suggest that you use the
  per-partition counts (dshash_partition.count) instead of scanning individual
  entries.


[1] https://www.postgresql.org/message-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
Hi,

I've started looking at the patch over the past few days. I don't have 
any deep insights at this point, but there seems to be some sort of 
issue in pgstat_update_stat. When building using gcc, I do get this warning:

pgstat.c: In function ‘pgstat_update_stat’:
pgstat.c:648:18: warning: ‘now’ may be used uninitialized in this 
function [-Wmaybe-uninitialized]
    oldest_pending = now;
    ~~~~~~~~~~~~~~~^~~~~
PostgreSQL installation complete.


which kinda makes sense, because 'now' is set only in the (!force) 
branch. So if the very first call to pgstat_update_stat is with 
force=true, it's not set, and the code executes this:

     /* record oldest pending update time */
     if (pgStatPendingTabHash == NULL)
         oldest_pending = 0;
     else if (oldest_pending == 0)
         oldest_pending = now;

at which point we set "oldest_pending = now" with "now" containing some 
random garbage.

When running this under valgrind, I get a couple of warnings in this 
area of code - see the attached log with a small sample. Judging by the 
locations I assume those are related to the same issue, but I have not 
looked into that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: shared-memory based stats collector

От
Tomas Vondra
Дата:

On 10/05/2018 10:30 AM, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> It doesn't work nor even compile since I failed to include some
>> changes. The atached v6-0008 at least compiles and words.
>>
>> 0001-0007 are not attached since they are still aplicable on
>> master head with offsets.
> 
> In this patchset 0001-0007 are still the same with the previous
> version. I'll reorganize the whole patchset in the next version.
> 
> This is more saner version of previous v5-0008, which didn't pass
> regression test. v6-0008 to v6-0010 are attached and they are
> applied on top of v5-0001-0007.
> 

BTW one more thing - I strongly recommend always attaching the whole
patch series, even if some of the parts did not change.

Firstly, it makes the reviewer's life much easier, because it's not
necessary to hunt through past messages for all the bits and resolve
potential conflicts (e.g. there are two 0008 in recent messages).

Secondly, it makes http://commitfest.cputube.org/ work - it tries to
apply patches from a single message, which fails when some of the parts
are omitted.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Thank you for the comments, Antonin, Tomas.

At Tue, 30 Oct 2018 13:35:23 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<b38591e0-54ca-7a27-813b-0bf91a204c5b@2ndquadrant.com>
> 
> 
> On 10/05/2018 10:30 AM, Kyotaro HORIGUCHI wrote:
> > Hello.
> > 
> > At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >> It doesn't work nor even compile since I failed to include some
> >> changes. The atached v6-0008 at least compiles and words.
> >>
> >> 0001-0007 are not attached since they are still aplicable on
> >> master head with offsets.
> > 
> > In this patchset 0001-0007 are still the same with the previous
> > version. I'll reorganize the whole patchset in the next version.
> > 
> > This is more saner version of previous v5-0008, which didn't pass
> > regression test. v6-0008 to v6-0010 are attached and they are
> > applied on top of v5-0001-0007.
> > 
> 
> BTW one more thing - I strongly recommend always attaching the whole
> patch series, even if some of the parts did not change.
> 
> Firstly, it makes the reviewer's life much easier, because it's not
> necessary to hunt through past messages for all the bits and resolve
> potential conflicts (e.g. there are two 0008 in recent messages).
> 
> Secondly, it makes http://commitfest.cputube.org/ work - it tries to
> apply patches from a single message, which fails when some of the parts
> are omitted.

Yeah, I know about the second point. About the second point, I'm
not sure which makes review easier confirming difference in all
files or finding not-modified portion from upthread. But ok, I'l
follow your suggestion. I'll attach the whole patches and add
note about difference.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Thank you for the comments.

This message contains the whole refactored patch set.

At Mon, 29 Oct 2018 15:10:10 +0100, Antonin Houska <ah@cybertec.at> wrote in <28855.1540822210@localhost>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 
> > This is more saner version of previous v5-0008, which didn't pass
> > regression test. v6-0008 to v6-0010 are attached and they are
> > applied on top of v5-0001-0007.
> > 
> > - stats collector has been removed.
> > 
> > - modified dshash further so that deletion is allowed during
> >   sequential scan.
> > 
> > - I'm not sure about the following existing comment at the
> >   beginning of pgstat.c
> > 
> >   *    - Add a pgstat config column to pg_database, so this
> >   *      entire thing can be enabled/disabled on a per db basis.
> 
> Following is the next handful of my comments:
> 
> * If you remove the stats collector, I think the remaining code in pgstat.c
>   does no longer fit into the backend/postmaster/ directory.

I didn't condier that, but I still don't find nice place among
existing directories. backend/statistics may be that but I feel
somewhat uneasy.. Finally I split pgstat.c into two files
(suggested in the file comment) and put both of them into a new
directory backend/statmon. One part is backend status faclity,
named bestatus (BackEnd Status) and one is pgstat, access
statistics part. The last 0008 patch does that. I tried to move
it ealier but it was a bit tough.

> * I'm not sure it's o.k. to call pgstat_write_statsfiles() from
>   postmaster.c:reaper(): the function can raise ERROR (I see at least one code
>   path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
>   a signal handler, it's hard to imagine the consequences. Maybe a reason to
>   leave some functionality in a separate worker, although the "stats
>   collector" would have to be changed.

I was careless about that. longjmp() in signal handler is
inhibited so we mustn't emit ERROR there. In the first place they
are placed in wrong places from several perspective. I changed
the way load and store.  In the attached patch (0003), load is
performed on the way initializing shared memory on postmaster and
storing is done in shutdown hook on postmaster. Since the shared
memory area is inherited to all children so no process actually
does initial attaching any longer. Addition to that archiver
process became an auxiliary process (0004) since it writes to the
shared stats.

> * Question still remains whether all the statistics should be loaded into
>   shared memory, see the note on paging near the bottom of [1].

Even counting possible page-in latency on stats writing, I agree
to what Robert said in the message that we will win by average
regarding to users who don't create so many databases.  If some
stats page to write were paged-out, also related heap page would
have been evicted out from shared buffers (or the buffer page
itself may have been paged-out) and every resources that can be
stashed out may be stashed out. So I don't think it becomes a
serious problem. On reading stats, it is currently reading a file
and sometimes waiting for maiking a up-to-date file. I think we
are needless to say about the case.

For cluster with many-many databases, a backend running on a
database will mainly see only stats for the current database (and
about shared tables) we can split stats by that criteria in the
next step.

> * if dshash_seq_init() is passed consistent=false, shouldn't we call
>   ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
>   needs to access the next partition and the old partition lock got released,
>   the table can be resized before the next partition lock is acquired, and
>   thus the backend-local copy of buckets becomes obsolete.

Oops. You're right.  Addition to that, resizing can happen while
dshash_seq_next moves the lock to the next partition. Resizing
happens on the way breaks sequential scan semantics. I added
ensure_valid_bucket_pointers() after the initial acquisition of
partition lock and move the lock seamlessly during scan. (0001)

> * Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
>   to be used in the current patch version.

Thanks. This is not used since we concluded that we no longer
need strict consistency in stats numbers. Removed. (0003)

> * pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

Bgwriter and checkpointer waited for postmaster's loading of
stats files. But I changed the startup sequence (as mentioned
above), so the wait became useless. Most of them are reaplced
with Assert. (0003)

> * pgstat_get_db_entry(): "return false" should probably be "return NULL".

I don't find that. (Isn't it caught by compiler?) Maybe it is
"found = false"?  (it might be a bit tricky)

> * Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
>   couldn't find a place where it's value is tested.

Thank you for fiding that. As pointed, PGSTAT_TABLE_WRITE is
finally not used since WRITE is always accompanied by CREATE in
the patch. I think WRITE is more readable than CREATE there so I
removed CREATE. I renamed all PGSTAT_TABLE_ symbols as the
follows while fixing this.

PGSTAST_TABLE_READ        -> PGSTAT_FETCH_SHARED
PGSTAST_TABLE_WRITE       -> PGSTAT_FETCH_EXCLUSIVE
PGSTAST_TABLE_NOWAIT      -> PGSTAT_FETCH_NOWAIT

PGSTAST_TABLE_NOT_FOUND   -> PGSTAT_ENTRY_NOT_FOUND
PGSTAST_TABLE_FOUND       -> PGSTAT_ENTRY_FOUND
PGSTAST_TABLE_LOCK_FAILED -> PGSTAT_LOCK_FAILED


> * dshash_seq_init(): does it need to be called with consistent=true from
>   pgstat_vacuum_stat() when the the entries returned by the scan are just
>   dropped?
> 
>     dshash_seq_init(&dshstat, db_stats, true, true);
> 
> I suspect this is a thinko because another call from the same function looks
> like
> 
>     dshash_seq_init(&dshstat, dshtable, false, true);

It's a left-behind, which was just migrated from the previous (in
v4) snapshot-based code. Snapshots had such consistency. But it
no longer looks useful. (0003)

As the result consistent=false in all caller site so I can remove
the parameter but I leave it alone for a while.

> * I'm not sure about usefulness of dshash_get_num_entries(). It passes
>   consistent=false to dshash_seq_init(), so the number of entries can change
>   during the execution. And even if the function requested a "consistent
>   scan", entries can be added / removed as soon as the scan is over.
> 
>   If the returned value is used to decide whether the hashtable should be
>   scanned or not, I think you don't break anything if you simply start the
>   scan unconditionally and see if you find some entries.
>   And if you really need to count the entries, I suggest that you use the
>   per-partition counts (dshash_partition.count) instead of scanning individual
>   entries.

It is mainly to avoid useless call to pgstat_collect_oid(). The
shortcut is useful because funcstat is not taken ususally.
Instead, I removed the function and create function stats dshash
on-demand. Then changed the condition "dshash_get_num_entries() >
0" to "dbentry->functions != DSM_HANDLE_INVALID". (0003)

> [1] https://www.postgresql.org/message-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com

New version of this patch is attched to reply to Tomas's message.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. Thank you for looking this.

At Tue, 30 Oct 2018 01:49:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<5253d750-890b-069b-031f-2a9b73e47832@2ndquadrant.com>
> Hi,
> 
> I've started looking at the patch over the past few days. I don't have
> any deep insights at this point, but there seems to be some sort of
> issue in pgstat_update_stat. When building using gcc, I do get this
> warning:
> 
> pgstat.c: In function ‘pgstat_update_stat’:
> pgstat.c:648:18: warning: ‘now’ may be used uninitialized in this
> function [-Wmaybe-uninitialized]
>    oldest_pending = now;
>    ~~~~~~~~~~~~~~~^~~~~
> PostgreSQL installation complete.

Uggh! The reason for the code is "last_report = now" comes later
than the code around... Fixed.

> When running this under valgrind, I get a couple of warnings in this
> area of code - see the attached log with a small sample. Judging by
> the locations I assume those are related to the same issue, but I have
> not looked into that.

There was several typos/thinkos related to pointers modifed from
original variables. There was a code like the following in the
original code.

  memset(&shared_globalStats, 0, siazeof(shared_globalStats));

It was not fixed despite this patch changes the type of the
variable from PgStat_GlboalStats to (PgStat_GlobalStats *). As
the result major part of the varialbe remaineduninitialized.

I re-ran this version on valgrind and I didn't see such kind of
problem. Thank you for the testing.



I refactored the patch into complete set consists of 8 files.

v7-0001-sequential-scan-for-dshash.patch
  - dshash sequential scan feature

v7-0002-Add-conditional-lock-feature-to-dshash.patch
  - dshash contitional lock feature

v7-0003-Shared-memory-based-stats-collector.patch
  - Shared memory base stats collector.
    - Remove stats collector process
    - Change stats collector to shared memory base
  (This needs 0004 to work, but it is currently separate from
   this for readability)

v7-0004-Make-archiver-process-an-auxiliary-process.patch
  - Archiver process needs to touch shared memory.
    (I didn't check EXEC_BACKEND case)

v7-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
  - I removed pg_stat_tmp directory so move pg_stat_statements
    file stored there to pg_stat directory. (This would need fix)

v7-0006-Remove-pg_stat_tmp-exclusion-from-pg_rewind.patch
  - For the same reason with 0005.

v7-0007-Documentation-update.patch
  - Removes description related to pg_stat_tmp.

v7-0008-Split-out-backend-status-monitor-part-from-pgstat.patch
  - Just refactoring. Splits the current postmaster/pgstat.c into
    two files statmon/pgstat.c and statmon/bestatus.c.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e83b2c82caed6bdaa9257f95c0011f333ecb9d51 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/8] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b2b8fe60e1..af904c034e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8ab1a21f3e 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From f916bd97c27ef08c8a13aff97b277058961784a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/8] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index af904c034e..d8bdaecae5 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,6 +394,17 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, NULL, exclusive, false);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool *lock_acquired, bool exclusive, bool nowait)
 {
     dshash_hash hash;
     size_t        partition;
@@ -405,8 +416,23 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +467,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +497,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8ab1a21f3e..475d22ab55 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool *lock_acquired, bool exclusive, bool nowait);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 0c02d9ff5c154a8184ca07a81b1cdb0e69d25262 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 3/8] Shared-memory based stats collector

This replaces the means to share server stats numbers from file to
dynamic shared memory. Every backend directly reads and writres to the
stats tables. Stats collector process is removed and archiver process
was changed to an auxiliary process in order to access shared memory.
Update of shared stats happens with the intervals not shorter than
500ms and no longer than 1s. If the shared stats hash is busy and a
backend cannot obtain lock on the shared stats, usually the numbers
are stashed into "pending stats" on local memory and merged with the
next writing.
---
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/postmaster/autovacuum.c           |   59 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/pgarch.c               |    4 +-
 src/backend/postmaster/pgstat.c               | 4208 ++++++++++---------------
 src/backend/postmaster/postmaster.c           |   82 +-
 src/backend/replication/basebackup.c          |   36 -
 src/backend/replication/logical/tablesync.c   |    9 +-
 src/backend/replication/logical/worker.c      |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/dsm.c                 |   24 +-
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/tcop/postgres.c                   |   27 +-
 src/backend/utils/adt/pgstatfuncs.c           |   50 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    2 +-
 src/include/pgstat.h                          |  438 +--
 src/include/storage/dsm.h                     |    2 +
 src/include/storage/lwlock.h                  |    3 +
 src/include/utils/timeout.h                   |    1 +
 src/test/modules/worker_spi/worker_spi.c      |    2 +-
 29 files changed, 1927 insertions(+), 3131 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 246869bba2..e72040178a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8598,9 +8598,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 978089575b..10e707e9a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -977,7 +977,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -986,6 +986,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -1001,7 +1002,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1013,6 +1014,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1025,7 +1027,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1037,6 +1039,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1235,7 +1238,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1273,16 +1276,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1971,7 +1980,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2021,7 +2030,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2107,6 +2116,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2186,10 +2197,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2758,12 +2770,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2795,8 +2805,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2827,6 +2837,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2917,7 +2929,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c53..a4b1079e60 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -271,9 +271,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..9235390bc6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -376,7 +376,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -402,7 +402,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -520,13 +520,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -694,9 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1296,8 +1296,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 885e85ad8a..3ca36d62a4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -466,7 +466,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -476,7 +476,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 42bccce0af..cca64eca83 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -19,92 +14,59 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -127,32 +89,64 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
 
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
 
-static struct sockaddr_storage pgStatAddr;
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
@@ -189,18 +183,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -237,6 +227,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -250,23 +246,15 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,32 +268,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -316,320 +313,16 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
 /*
  * subroutine for pgstat_reset_all
  */
@@ -678,119 +371,54 @@ pgstat_reset_remove_files(const char *directory)
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
+/* ----------
+ * pgstat_create_shared_stats() -
  *
- * Format up the arglist for, then fork and exec, statistics collector process
+ *    create shared stats memory
+ * ----------
  */
-static pid_t
-pgstat_forkexec(void)
+static void
+pgstat_create_shared_stats(void)
 {
-    char       *av[10];
-    int            ac = 0;
+    MemoryContext oldcontext;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
 
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
 
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
 
 /* ------------------------------------------------------------
  * Public functions used by backends follow
@@ -802,41 +430,107 @@ allow_immediate_pgstat_restart(void)
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
  */
-void
-pgstat_report_stat(bool force)
+long
+pgstat_update_stat(bool force)
 {
     /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
     static TimestampTz last_report = 0;
-
+    static TimestampTz oldest_pending = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
     last_report = now;
 
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
      * entries it points to.  (Should we fail partway through the loop below,
@@ -849,23 +543,55 @@ pgstat_report_stat(bool force)
     pgStatTabHash = NULL;
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
 
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            PgStat_TableStatus *pentry = NULL;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -878,178 +604,440 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
 {
-    int            n;
-    int            len;
+    Assert (deststat != srcstat);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
     }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
 }
-
+        
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
  */
 static void
-pgstat_send_funcstats(void)
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* find the shared function stats table */
+    if (!dbentry)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
 
-    have_function_stats = false;
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
 }
 
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId);
+    oidtab = pgstat_collect_oids(DatabaseRelationId);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1057,148 +1045,86 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId);
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
-/* ----------
+/*
  * pgstat_collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
  */
 static HTAB *
 pgstat_collect_oids(Oid catalogid)
@@ -1241,62 +1167,54 @@ pgstat_collect_oids(Oid catalogid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1305,20 +1223,51 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1327,29 +1276,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert(db_stats);
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1358,17 +1315,90 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1382,48 +1412,75 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Repot about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1434,9 +1491,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1465,114 +1527,228 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgstat_pending_tempfile)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * clean up function for temporary files
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1688,9 +1864,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1712,6 +1885,15 @@ pgstat_initstats(Relation rel)
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
 
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
           relkind == RELKIND_MATVIEW ||
@@ -1723,13 +1905,6 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
     /*
      * If we already set up this relation in the current transaction, nothing
      * to do.
@@ -2373,34 +2548,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2413,47 +2560,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2472,18 +2600,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2558,9 +2682,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2575,9 +2701,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2767,7 +2895,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2956,7 +3084,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3223,7 +3351,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4140,96 +4269,68 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4237,302 +4338,15 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    pqsignal(SIGCHLD, SIG_DFL);
-    pqsignal(SIGTTIN, SIG_DFL);
-    pqsignal(SIGTTOU, SIG_DFL);
-    pqsignal(SIGCONT, SIG_DFL);
-    pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4558,20 +4372,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4580,47 +4391,76 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4656,29 +4496,23 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4697,7 +4531,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4709,32 +4543,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4777,16 +4608,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4794,15 +4615,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4820,10 +4640,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4832,9 +4652,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4861,23 +4682,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -4912,47 +4740,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     /*
      * The tables will live in pgStatLocalContext.
@@ -4960,28 +4771,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4995,11 +4796,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        LWLockRelease(StatsLock);
+        return;
     }
 
     /*
@@ -5008,7 +4810,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5016,11 +4818,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5031,17 +4834,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5061,7 +4864,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5070,21 +4873,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5092,54 +4897,26 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5147,36 +4924,62 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
+    LWLockRelease(StatsLock);
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5187,7 +4990,10 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5201,7 +5007,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5214,7 +5020,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5234,7 +5040,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5246,19 +5052,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5268,7 +5076,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5280,19 +5088,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5302,7 +5111,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5312,276 +5121,290 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5612,6 +5435,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5621,717 +5446,112 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
+
+    /*
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
+     */
+    backend_clean_snapshot_callback(¶m);
 }
 
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Otherwise add the values to the existing entry.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
     /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 688f462e7d..a7d6ddeac7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -1293,12 +1292,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1751,11 +1744,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = pgarch_start();
@@ -2580,8 +2568,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2912,8 +2898,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2980,13 +2964,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3063,22 +3040,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3305,7 +3266,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3525,22 +3486,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3744,8 +3689,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3784,8 +3727,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3985,8 +3927,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4959,18 +4899,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5083,12 +5011,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b20f6c379c..20cf33354a 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6e420d893c..862582da23 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -127,7 +127,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -143,6 +143,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -533,7 +536,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -876,7 +879,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 277da69fa6..087850d089 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -491,7 +491,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1324,6 +1324,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 9629f22f7a..32508353b7 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..cce6d3ffa2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -282,8 +283,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a3b9757565..ee4e43331b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3146,6 +3146,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4160,9 +4167,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4197,7 +4212,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4205,6 +4220,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..eca801eeed 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1850,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1882,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..1377bbbbdb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..1e4fa89135 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6b..37f3389bd0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -188,7 +188,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3768,17 +3767,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10727,35 +10715,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4e61bc6521..1277740473 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -512,7 +512,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f0c1..f13b2dde6b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d6b32c070c..2e74ec9f60 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -402,7 +403,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8..e5f912cf71 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -41,32 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,271 +145,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -485,79 +202,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -601,10 +245,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1141,7 +788,7 @@ extern char *pgstat_stat_filename;
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1153,34 +800,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1191,6 +824,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1218,6 +853,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1337,15 +975,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1354,4 +992,14 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 169de946f7..be50f43db0 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,7 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
 /* Functions that create, update, or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73287..4cb628b15f 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..b8a56645b6 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 0d705a3f2e..da488ebfd4 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -295,7 +295,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From c1c17e0a186968a17cac7dabeb5655a0d4b39cae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 4/8] Make archiver process an auxiliary process

Shared-memory based stats collector make archiver process use shared
memory. Make the process an auxiliary process for the reason.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d..dab0addd8b 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -327,6 +327,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -454,6 +457,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3ca36d62a4..7d4e528096 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -66,7 +66,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -85,7 +84,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -103,75 +101,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -211,8 +140,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -247,8 +176,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index cca64eca83..2d3f7cb898 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2981,6 +2981,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4244,6 +4247,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a7d6ddeac7..559aeedb6e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -548,6 +549,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1746,7 +1748,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2897,7 +2899,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3033,10 +3035,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3266,7 +3266,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, stats collector or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3471,6 +3471,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3736,6 +3748,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -4991,7 +5004,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5294,6 +5307,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e74ec9f60..91c3fb1a0a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,6 +400,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e5f912cf71..4e51580076 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -353,6 +353,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 292e63a26a..5db1d7a5ea 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From d4147889d0d0489776831cff59be627335a1f82b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 10:59:17 +0900
Subject: [PATCH 5/8] Let pg_stat_statements not to use PG_STAT_TMP_DIR.

This patchset removes the definition because pg_stat.c no longer uses
the directory and no other sutable module to pass it over. As a
tentative solution this patch moves query text file into permanent
stats directory. pg_basebackup and pg_rewind are conscious of the
directory. They currently omit the text file but becomes to copy it by
this change.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79f54..ec2fa9881c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
-- 
2.16.3

From 054460602cc4d02b440072ba56b8d00fae7fcbf5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:46:43 +0900
Subject: [PATCH 6/8] Remove pg_stat_tmp exclusion from pg_rewind

The directory "pg_stat_tmp" no longer exists so remove it from the
exclusion list.
---
 src/bin/pg_rewind/filemap.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 222b56f58a..ef2d594c91 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, bool is_source);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
-- 
2.16.3

From 2ddf0046c001f1aa38a100f75f832f0bfb82c61d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 7/8] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7554cba3f9..cd6b4b6096 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6115,25 +6115,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 10816f556b..a759bfc70d 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15953,8 +15953,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index add71458e2..a1031b3b2a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f0b2145208..11f263f378 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2612,7 +2612,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index e2662bbf81..bf9c5dd580 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -270,8 +270,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From e45cd1072350fe5e0c3d5f8e75fbe38f2f588dd3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 8 Nov 2018 18:43:55 +0900
Subject: [PATCH 8/8] Split out backend status monitor part from pgstat

A large file, pgstat.c, contained two major facilities, backend status
monitor and database usage monitor. Split out the former part from the
file and name the module "bestatus".
---
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |   20 +-
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    4 +-
 src/backend/access/transam/clog.c                  |    6 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |   24 +-
 src/backend/access/transam/timeline.c              |   22 +-
 src/backend/access/transam/twophase.c              |   17 +-
 src/backend/access/transam/xact.c                  |   17 +-
 src/backend/access/transam/xlog.c                  |   67 +-
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    6 +-
 src/backend/bootstrap/bootstrap.c                  |   22 +-
 src/backend/commands/vacuumlazy.c                  |   37 +-
 src/backend/executor/execParallel.c                |    4 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |   11 +-
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    3 +-
 src/backend/postmaster/checkpointer.c              |    3 +-
 src/backend/postmaster/pgarch.c                    |    1 +
 src/backend/postmaster/postmaster.c                |    5 +-
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    4 +-
 src/backend/replication/basebackup.c               |    2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |   18 +-
 src/backend/replication/logical/snapbuild.c        |   26 +-
 src/backend/replication/logical/tablesync.c        |    6 +-
 src/backend/replication/logical/worker.c           |   17 +-
 src/backend/replication/slot.c                     |   26 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |   18 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1759 ++++++++++++++++++++
 src/backend/{postmaster => statmon}/pgstat.c       | 1717 +------------------
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |   12 +-
 src/backend/storage/file/fd.c                      |   25 +-
 src/backend/storage/ipc/dsm_impl.c                 |    6 +-
 src/backend/storage/ipc/latch.c                    |    6 +-
 src/backend/storage/ipc/procarray.c                |    6 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    3 +-
 src/backend/storage/lmgr/lwlock.c                  |    6 +-
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |   17 +-
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |   61 +-
 src/backend/utils/cache/relmapper.c                |   14 +-
 src/backend/utils/init/miscinit.c                  |   32 +-
 src/backend/utils/init/postinit.c                  |   12 +-
 src/backend/utils/misc/guc.c                       |    7 +-
 src/include/bestatus.h                             |  545 ++++++
 src/include/pgstat.h                               |  514 +-----
 70 files changed, 2691 insertions(+), 2507 deletions(-)
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 rename src/backend/{postmaster => statmon}/pgstat.c (70%)
 create mode 100644 src/include/bestatus.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 3a58bf6685..9921dca7f9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 7127788964..4a7c284d81 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
@@ -1159,13 +1159,13 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
      * Truncate all data that's not guaranteed to have been safely fsynced (by
      * previous record or by the last checkpoint).
      */
-    pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE);
+    bestatus_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE);
     if (ftruncate(fd, xlrec->offset) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not truncate file \"%s\" to %u: %m",
                         path, (uint32) xlrec->offset)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /* now seek to the position we want to write our data to */
     if (lseek(fd, xlrec->offset, SEEK_SET) != xlrec->offset)
@@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 
     /* write out tail end of mapping file (again) */
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE);
     if (write(fd, data, len) != len)
     {
         /* if write didn't set errno, assume problem is no disk space */
@@ -1190,19 +1190,19 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
                 (errcode_for_file_access(),
                  errmsg("could not write to file \"%s\": %m", path)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /*
      * Now fsync all previously written data. We could improve things and only
      * do this for the last write to a file, but the required bookkeeping
      * doesn't seem worth the trouble.
      */
-    pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", path)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     CloseTransientFile(fd);
 }
@@ -1296,12 +1296,12 @@ CheckPointLogicalRewriteHeap(void)
              * changed or have only been created since the checkpoint's start,
              * but it's currently not deemed worth the effort.
              */
-            pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
+            bestatus_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
             if (pg_fsync(fd) != 0)
                 ereport(ERROR,
                         (errcode_for_file_access(),
                          errmsg("could not fsync file \"%s\": %m", path)));
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
             CloseTransientFile(fd);
         }
     }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..6679dbc3a5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..6f4fd371c8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -63,9 +63,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
@@ -1538,7 +1538,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
     debug_query_string = sharedquery;
 
     /* Report the query string from leader */
-    pgstat_report_activity(STATE_RUNNING, debug_query_string);
+    bestatus_report_activity(STATE_RUNNING, debug_query_string);
 
     /* Look up nbtree shared state */
     btshared = shm_toc_lookup(toc, PARALLEL_KEY_BTREE_SHARED, false);
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8b7ff5b0c2..85e364b117 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
@@ -480,7 +480,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
         int            extraWaits = 0;
 
         /* Sleep until the leader updates our XID status. */
-        pgstat_report_wait_start(WAIT_EVENT_CLOG_GROUP_UPDATE);
+        bestatus_report_wait_start(WAIT_EVENT_CLOG_GROUP_UPDATE);
         for (;;)
         {
             /* acts as a read barrier */
@@ -489,7 +489,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
                 break;
             extraWaits++;
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         Assert(pg_atomic_read_u32(&proc->clogGroupNext) == INVALID_PGPROCNO);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 84197192ec..7e5c84bd5f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1132eef038..d564cf8ff3 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
@@ -680,16 +680,16 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
     }
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
+    bestatus_report_wait_start(WAIT_EVENT_SLRU_READ);
     if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
     {
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         slru_errcause = SLRU_READ_FAILED;
         slru_errno = errno;
         CloseTransientFile(fd);
         return false;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd))
     {
@@ -841,10 +841,10 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
     }
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_SLRU_WRITE);
     if (write(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
     {
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         /* if write didn't set errno, assume problem is no disk space */
         if (errno == 0)
             errno = ENOSPC;
@@ -854,7 +854,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
             CloseTransientFile(fd);
         return false;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /*
      * If not part of Flush, need to fsync now.  We assume this happens
@@ -862,16 +862,16 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
      */
     if (!fdata)
     {
-        pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+        bestatus_report_wait_start(WAIT_EVENT_SLRU_SYNC);
         if (ctl->do_fsync && pg_fsync(fd))
         {
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
             slru_errcause = SLRU_FSYNC_FAILED;
             slru_errno = errno;
             CloseTransientFile(fd);
             return false;
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         if (CloseTransientFile(fd))
         {
@@ -1139,7 +1139,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
     ok = true;
     for (i = 0; i < fdata.num_files; i++)
     {
-        pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
+        bestatus_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
         if (ctl->do_fsync && pg_fsync(fdata.fd[i]))
         {
             slru_errcause = SLRU_FSYNC_FAILED;
@@ -1147,7 +1147,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
             pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
             ok = false;
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         if (CloseTransientFile(fdata.fd[i]))
         {
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c3..626eb0ead2 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
@@ -338,9 +338,9 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
         for (;;)
         {
             errno = 0;
-            pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+            bestatus_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
             nbytes = (int) read(srcfd, buffer, sizeof(buffer));
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
             if (nbytes < 0 || errno != 0)
                 ereport(ERROR,
                         (errcode_for_file_access(),
@@ -348,7 +348,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
             if (nbytes == 0)
                 break;
             errno = 0;
-            pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_WRITE);
+            bestatus_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_WRITE);
             if ((int) write(fd, buffer, nbytes) != nbytes)
             {
                 int            save_errno = errno;
@@ -368,7 +368,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
                         (errcode_for_file_access(),
                          errmsg("could not write to file \"%s\": %m", tmppath)));
             }
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
         }
         CloseTransientFile(srcfd);
     }
@@ -404,12 +404,12 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
                  errmsg("could not write to file \"%s\": %m", tmppath)));
     }
 
-    pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", tmppath)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd))
         ereport(ERROR,
@@ -465,7 +465,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
                  errmsg("could not create file \"%s\": %m", tmppath)));
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE);
     if ((int) write(fd, content, size) != size)
     {
         int            save_errno = errno;
@@ -481,14 +481,14 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
                 (errcode_for_file_access(),
                  errmsg("could not write to file \"%s\": %m", tmppath)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
-    pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", tmppath)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd))
         ereport(ERROR,
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3942734e5a..eaf0c5ac64 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
@@ -1271,7 +1272,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
      */
     buf = (char *) palloc(stat.st_size);
 
-    pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
+    bestatus_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
     r = read(fd, buf, stat.st_size);
     if (r != stat.st_size)
     {
@@ -1285,7 +1286,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
                             path, r, (Size) stat.st_size)));
     }
 
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     CloseTransientFile(fd);
 
     hdr = (TwoPhaseFileHeader *) buf;
@@ -1660,12 +1661,12 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
 
     /* Write content and CRC */
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_WRITE);
     if (write(fd, content, len) != len)
     {
         int            save_errno = errno;
 
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         CloseTransientFile(fd);
 
         /* if write didn't set errno, assume problem is no disk space */
@@ -1678,7 +1679,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
     {
         int            save_errno = errno;
 
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         CloseTransientFile(fd);
 
         /* if write didn't set errno, assume problem is no disk space */
@@ -1687,13 +1688,13 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
                 (errcode_for_file_access(),
                  errmsg("could not write file \"%s\": %m", path)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /*
      * We must fsync the file because the end-of-replay checkpoint will not do
      * so, there being no GXACT in shared memory yet to tell it to.
      */
-    pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
@@ -1704,7 +1705,7 @@ RecreateTwoPhaseFile(TransactionId xid, void *content, int len)
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", path)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd) != 0)
         ereport(ERROR,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8c1621d949..a0f9dd0b2e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -1905,7 +1906,7 @@ StartTransaction(void)
     }
     else
         Assert(xactStartTimestamp != 0);
-    pgstat_report_xact_timestamp(xactStartTimestamp);
+    bestatus_report_xact_timestamp(xactStartTimestamp);
     /* Mark xactStopTimestamp as unset. */
     xactStopTimestamp = 0;
 
@@ -2146,7 +2147,7 @@ CommitTransaction(void)
     AtEOXact_PgStat(true);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
-    pgstat_report_xact_timestamp(0);
+    bestatus_report_xact_timestamp(0);
 
     CurrentResourceOwner = NULL;
     ResourceOwnerDelete(TopTransactionResourceOwner);
@@ -2424,7 +2425,7 @@ PrepareTransaction(void)
     AtEOXact_HashTables(true);
     /* don't call AtEOXact_PgStat here; we fixed pgstat state above */
     AtEOXact_Snapshot(true, true);
-    pgstat_report_xact_timestamp(0);
+    bestatus_report_xact_timestamp(0);
 
     CurrentResourceOwner = NULL;
     ResourceOwnerDelete(TopTransactionResourceOwner);
@@ -2481,8 +2482,8 @@ AbortTransaction(void)
     LWLockReleaseAll();
 
     /* Clear wait information and command progress indicator */
-    pgstat_report_wait_end();
-    pgstat_progress_end_command();
+    bestatus_report_wait_end();
+    bestatus_progress_end_command();
 
     /* Clean up buffer I/O and buffer context locks, too */
     AbortBufferIO();
@@ -2627,7 +2628,7 @@ AbortTransaction(void)
         AtEOXact_HashTables(false);
         AtEOXact_PgStat(false);
         AtEOXact_ApplyLauncher(false);
-        pgstat_report_xact_timestamp(0);
+        bestatus_report_xact_timestamp(0);
     }
 
     /*
@@ -4703,8 +4704,8 @@ AbortSubTransaction(void)
      */
     LWLockReleaseAll();
 
-    pgstat_report_wait_end();
-    pgstat_progress_end_command();
+    bestatus_report_wait_end();
+    bestatus_progress_end_command();
     AbortBufferIO();
     UnlockBuffers();
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e72040178a..b5289197fa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -2497,9 +2498,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             do
             {
                 errno = 0;
-                pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+                bestatus_report_wait_start(WAIT_EVENT_WAL_WRITE);
                 written = write(openLogFile, from, nleft);
-                pgstat_report_wait_end();
+                bestatus_report_wait_end();
                 if (written <= 0)
                 {
                     if (errno == EINTR)
@@ -3268,7 +3269,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
     for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
     {
         errno = 0;
-        pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+        bestatus_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
         if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
         {
             int            save_errno = errno;
@@ -3287,10 +3288,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
                     (errcode_for_file_access(),
                      errmsg("could not write to file \"%s\": %m", tmppath)));
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
     }
 
-    pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
@@ -3301,7 +3302,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", tmppath)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (close(fd))
         ereport(ERROR,
@@ -3427,7 +3428,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
             if (nread > sizeof(buffer))
                 nread = sizeof(buffer);
-            pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+            bestatus_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
             r = read(srcfd, buffer.data, nread);
             if (r != nread)
             {
@@ -3442,10 +3443,10 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
                              errmsg("could not read file \"%s\": read %d of %zu",
                                     path, r, (Size) nread)));
             }
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
         }
         errno = 0;
-        pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_WRITE);
+        bestatus_report_wait_start(WAIT_EVENT_WAL_COPY_WRITE);
         if ((int) write(fd, buffer.data, sizeof(buffer)) != (int) sizeof(buffer))
         {
             int            save_errno = errno;
@@ -3461,15 +3462,15 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
                     (errcode_for_file_access(),
                      errmsg("could not write to file \"%s\": %m", tmppath)));
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
     }
 
-    pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", tmppath)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd))
         ereport(ERROR,
@@ -4521,7 +4522,7 @@ WriteControlFile(void)
                         XLOG_CONTROL_FILE)));
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE);
     if (write(fd, buffer, PG_CONTROL_FILE_SIZE) != PG_CONTROL_FILE_SIZE)
     {
         /* if write didn't set errno, assume problem is no disk space */
@@ -4532,15 +4533,15 @@ WriteControlFile(void)
                  errmsg("could not write to file \"%s\": %m",
                         XLOG_CONTROL_FILE)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
-    pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(PANIC,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m",
                         XLOG_CONTROL_FILE)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (close(fd))
         ereport(PANIC,
@@ -4568,7 +4569,7 @@ ReadControlFile(void)
                  errmsg("could not open file \"%s\": %m",
                         XLOG_CONTROL_FILE)));
 
-    pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_READ);
+    bestatus_report_wait_start(WAIT_EVENT_CONTROL_FILE_READ);
     r = read(fd, ControlFile, sizeof(ControlFileData));
     if (r != sizeof(ControlFileData))
     {
@@ -4583,7 +4584,7 @@ ReadControlFile(void)
                      errmsg("could not read file \"%s\": read %d of %zu",
                             XLOG_CONTROL_FILE, r, sizeof(ControlFileData))));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     close(fd);
 
@@ -4780,7 +4781,7 @@ UpdateControlFile(void)
                  errmsg("could not open file \"%s\": %m", XLOG_CONTROL_FILE)));
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
+    bestatus_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
     if (write(fd, ControlFile, sizeof(ControlFileData)) != sizeof(ControlFileData))
     {
         /* if write didn't set errno, assume problem is no disk space */
@@ -4791,15 +4792,15 @@ UpdateControlFile(void)
                  errmsg("could not write to file \"%s\": %m",
                         XLOG_CONTROL_FILE)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
-    pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
+    bestatus_report_wait_start(WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE);
     if (pg_fsync(fd) != 0)
         ereport(PANIC,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m",
                         XLOG_CONTROL_FILE)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (close(fd))
         ereport(PANIC,
@@ -5219,7 +5220,7 @@ BootStrapXLOG(void)
 
     /* Write the first page with the initial record */
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
     if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
     {
         /* if write didn't set errno, assume problem is no disk space */
@@ -5229,14 +5230,14 @@ BootStrapXLOG(void)
                 (errcode_for_file_access(),
                  errmsg("could not write bootstrap write-ahead log file: %m")));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
-    pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
     if (pg_fsync(openLogFile) != 0)
         ereport(PANIC,
                 (errcode_for_file_access(),
                  errmsg("could not fsync bootstrap write-ahead log file: %m")));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (close(openLogFile))
         ereport(PANIC,
@@ -10276,13 +10277,13 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
          */
         if (openLogFile >= 0)
         {
-            pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
+            bestatus_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
             if (pg_fsync(openLogFile) != 0)
                 ereport(PANIC,
                         (errcode_for_file_access(),
                          errmsg("could not fsync file \"%s\": %m",
                                 XLogFileNameP(ThisTimeLineID, openLogSegNo))));
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
             if (get_sync_bit(sync_method) != get_sync_bit(new_sync_method))
                 XLogFileClose();
         }
@@ -10299,7 +10300,7 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 void
 issue_xlog_fsync(int fd, XLogSegNo segno)
 {
-    pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_SYNC);
     switch (sync_method)
     {
         case SYNC_METHOD_FSYNC:
@@ -10335,7 +10336,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
             elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
             break;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 }
 
 /*
@@ -11835,14 +11836,14 @@ retry:
         goto next_record_is_invalid;
     }
 
-    pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+    bestatus_report_wait_start(WAIT_EVENT_WAL_READ);
     r = read(readFile, readBuf, XLOG_BLCKSZ);
     if (r != XLOG_BLCKSZ)
     {
         char        fname[MAXFNAMELEN];
         int            save_errno = errno;
 
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
         if (r < 0)
         {
@@ -11859,7 +11860,7 @@ retry:
                             fname, readOff, r, (Size) XLOG_BLCKSZ)));
         goto next_record_is_invalid;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     Assert(targetSegNo == readSegNo);
     Assert(targetPageOff == readOff);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a31adcca5e..b72da3f45f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc9220f..1d9e1ca2b9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
@@ -736,9 +736,9 @@ XLogRead(char *buf, int segsize, TimeLineID tli, XLogRecPtr startptr,
         else
             segbytes = nbytes;
 
-        pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+        bestatus_report_wait_start(WAIT_EVENT_WAL_READ);
         readbytes = read(sendFile, p, segbytes);
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         if (readbytes <= 0)
         {
             char        path[MAXPGPATH];
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index dab0addd8b..1053985670 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -322,22 +323,22 @@ AuxiliaryProcessMain(int argc, char *argv[])
         switch (MyAuxProcType)
         {
             case StartupProcess:
-                statmsg = pgstat_get_backend_desc(B_STARTUP);
+                statmsg = bestatus_get_backend_desc(B_STARTUP);
                 break;
             case BgWriterProcess:
-                statmsg = pgstat_get_backend_desc(B_BG_WRITER);
-                break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                statmsg = bestatus_get_backend_desc(B_BG_WRITER);
                 break;
             case CheckpointerProcess:
-                statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
+                statmsg = bestatus_get_backend_desc(B_CHECKPOINTER);
                 break;
             case WalWriterProcess:
-                statmsg = pgstat_get_backend_desc(B_WAL_WRITER);
+                statmsg = bestatus_get_backend_desc(B_WAL_WRITER);
                 break;
             case WalReceiverProcess:
-                statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
+                statmsg = bestatus_get_backend_desc(B_WAL_RECEIVER);
+                break;
+            case ArchiverProcess:
+                statmsg = bestatus_get_backend_desc(B_ARCHIVER);
                 break;
             default:
                 statmsg = "??? process";
@@ -416,7 +417,8 @@ AuxiliaryProcessMain(int argc, char *argv[])
 
         /* Initialize backend status information */
         pgstat_initialize();
-        pgstat_bestart();
+        bestatus_initialize();
+        bestatus_bestart();
 
         /* register a before-shutdown callback for LWLock cleanup */
         before_shmem_exit(ShutdownAuxiliaryProcess, 0);
@@ -583,7 +585,7 @@ ShutdownAuxiliaryProcess(int code, Datum arg)
 {
     LWLockReleaseAll();
     ConditionVariableCancelSleep();
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..6a14165907 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
@@ -223,7 +224,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
     else
         elevel = DEBUG2;
 
-    pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
+    bestatus_progress_start_command(PROGRESS_COMMAND_VACUUM,
                                   RelationGetRelid(onerel));
 
     vac_strategy = bstrategy;
@@ -290,7 +291,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
         lazy_truncate_heap(onerel, vacrelstats);
 
     /* Report that we are now doing final cleanup */
-    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+    bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                  PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
 
     /*
@@ -343,7 +344,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
                          vacrelstats->new_dead_tuples);
-    pgstat_progress_end_command();
+    bestatus_progress_end_command();
 
     /* and log the action if appropriate */
     if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
@@ -537,7 +538,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
     initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
     initprog_val[1] = nblocks;
     initprog_val[2] = vacrelstats->max_dead_tuples;
-    pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
+    bestatus_progress_update_multi_param(3, initprog_index, initprog_val);
 
     /*
      * Except when aggressive is set, we want to skip pages that are
@@ -633,7 +634,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 #define FORCE_CHECK_PAGE() \
         (blkno == nblocks - 1 && should_attempt_truncation(vacrelstats))
 
-        pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+        bestatus_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
         if (blkno == next_unskippable_block)
         {
@@ -739,7 +740,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             vacuum_log_cleanup_info(onerel, vacrelstats);
 
             /* Report that we are now vacuuming indexes */
-            pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+            bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                          PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
             /* Remove index entries */
@@ -751,12 +752,12 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             /*
              * Report that we are now vacuuming the heap.  We also increase
              * the number of index scans here; note that by using
-             * pgstat_progress_update_multi_param we can update both
+             * bestatus_progress_update_multi_param we can update both
              * parameters atomically.
              */
             hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
             hvp_val[1] = vacrelstats->num_index_scans + 1;
-            pgstat_progress_update_multi_param(2, hvp_index, hvp_val);
+            bestatus_progress_update_multi_param(2, hvp_index, hvp_val);
 
             /* Remove tuples from heap */
             lazy_vacuum_heap(onerel, vacrelstats);
@@ -777,7 +778,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
             next_fsm_block_to_vacuum = blkno;
 
             /* Report that we are once again scanning the heap */
-            pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+            bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                          PROGRESS_VACUUM_PHASE_SCAN_HEAP);
         }
 
@@ -1343,7 +1344,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
     }
 
     /* report that everything is scanned and vacuumed */
-    pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+    bestatus_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
 
     pfree(frozen);
 
@@ -1384,7 +1385,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
         vacuum_log_cleanup_info(onerel, vacrelstats);
 
         /* Report that we are now vacuuming indexes */
-        pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+        bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                      PROGRESS_VACUUM_PHASE_VACUUM_INDEX);
 
         /* Remove index entries */
@@ -1396,10 +1397,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
         /* Report that we are now vacuuming the heap */
         hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
         hvp_val[1] = vacrelstats->num_index_scans + 1;
-        pgstat_progress_update_multi_param(2, hvp_index, hvp_val);
+        bestatus_progress_update_multi_param(2, hvp_index, hvp_val);
 
         /* Remove tuples from heap */
-        pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+        bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                      PROGRESS_VACUUM_PHASE_VACUUM_HEAP);
         lazy_vacuum_heap(onerel, vacrelstats);
         vacrelstats->num_index_scans++;
@@ -1413,8 +1414,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
         FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
 
     /* report all blocks vacuumed; and that we're cleaning up */
-    pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
-    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+    bestatus_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+    bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                  PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
 
     /* Do post-vacuum cleanup and statistics update for each index */
@@ -1548,7 +1549,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
     TransactionId visibility_cutoff_xid;
     bool        all_frozen;
 
-    pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+    bestatus_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
 
     START_CRIT_SECTION();
 
@@ -1825,7 +1826,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
     pg_rusage_init(&ru0);
 
     /* Report that we are now truncating */
-    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
+    bestatus_progress_update_param(PROGRESS_VACUUM_PHASE,
                                  PROGRESS_VACUUM_PHASE_TRUNCATE);
 
     /*
@@ -2132,7 +2133,7 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
     {
         vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
         vacrelstats->num_dead_tuples++;
-        pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
+        bestatus_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
                                      vacrelstats->num_dead_tuples);
     }
 }
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 13ef232d39..b368da619f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
@@ -1358,7 +1358,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
     debug_query_string = queryDesc->sourceText;
 
     /* Report workers' query for monitoring purposes */
-    pgstat_report_activity(STATE_RUNNING, debug_query_string);
+    bestatus_report_activity(STATE_RUNNING, debug_query_string);
 
     /* Attach to the dynamic shared memory area. */
     area_space = shm_toc_lookup(toc, PARALLEL_KEY_DSA, false);
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 304ef07f2c..856f497d51 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -40,6 +40,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index ad16c783bd..281f27998a 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ffaa751f2..7a850e8192 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index dd94cffbd1..91d3a0bbbd 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6a576572bb..5a304c7405 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 4eb21fe89d..517b22a694 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 6eaed5bf0c..5906682fbf 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 10e707e9a1..2005021468 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
@@ -436,7 +437,7 @@ AutoVacLauncherMain(int argc, char *argv[])
     am_autovacuum_launcher = true;
 
     /* Identify myself via ps */
-    init_ps_display(pgstat_get_backend_desc(B_AUTOVAC_LAUNCHER), "", "", "");
+    init_ps_display(bestatus_get_backend_desc(B_AUTOVAC_LAUNCHER), "", "", "");
 
     ereport(DEBUG1,
             (errmsg("autovacuum launcher started")));
@@ -519,7 +520,7 @@ AutoVacLauncherMain(int argc, char *argv[])
          * transaction.
          */
         LWLockReleaseAll();
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         AbortBufferIO();
         UnlockBuffers();
         /* this is probably dead code, but let's be safe: */
@@ -1530,7 +1531,7 @@ AutoVacWorkerMain(int argc, char *argv[])
     am_autovacuum_worker = true;
 
     /* Identify myself via ps */
-    init_ps_display(pgstat_get_backend_desc(B_AUTOVAC_WORKER), "", "", "");
+    init_ps_display(bestatus_get_backend_desc(B_AUTOVAC_WORKER), "", "", "");
 
     SetProcessingMode(InitProcessing);
 
@@ -3173,7 +3174,7 @@ autovac_report_activity(autovac_table *tab)
     /* Set statement_timestamp() to current time for pg_stat_activity */
     SetCurrentStatementStartTimestamp();
 
-    pgstat_report_activity(STATE_RUNNING, activity);
+    bestatus_report_activity(STATE_RUNNING, activity);
 }
 
 /*
@@ -3212,7 +3213,7 @@ autovac_report_workitem(AutoVacuumWorkItem *workitem,
     /* Set statement_timestamp() to current time for pg_stat_activity */
     SetCurrentStatementStartTimestamp();
 
-    pgstat_report_activity(STATE_RUNNING, activity);
+    bestatus_report_activity(STATE_RUNNING, activity);
 }
 
 /*
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e146..01eaa187ff 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a4b1079e60..36f3c91286 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -222,7 +223,7 @@ BackgroundWriterMain(void)
         smgrcloseall();
 
         /* Report wait end here, when there is no further possibility of wait */
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
     }
 
     /* We can now handle ereport(ERROR) */
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9235390bc6..02aa48743f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -266,7 +267,7 @@ CheckpointerMain(void)
          */
         LWLockReleaseAll();
         ConditionVariableCancelSleep();
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         AbortBufferIO();
         UnlockBuffers();
         ReleaseAuxProcessResources(false);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7d4e528096..deec58b057 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,6 +34,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 559aeedb6e..db343b86c6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
@@ -3534,7 +3535,7 @@ LogChildExit(int lev, const char *procname, int pid, int exitstatus)
     const char *activity = NULL;
 
     if (!EXIT_STATUS_0(exitstatus))
-        activity = pgstat_get_crashed_backend_activity(pid,
+        activity = bestatus_get_crashed_backend_activity(pid,
                                                        activity_buffer,
                                                        sizeof(activity_buffer));
 
@@ -4254,7 +4255,7 @@ BackendInitialize(Port *port)
      * init_ps_display() to avoid abusing the parameters like this.
      */
     if (am_walsender)
-        init_ps_display(pgstat_get_backend_desc(B_WAL_SENDER), port->user_name, remote_ps_data,
+        init_ps_display(bestatus_get_backend_desc(B_WAL_SENDER), port->user_name, remote_ps_data,
                         update_process_title ? "authentication" : "");
     else
         init_ps_display(port->user_name, port->database_name, remote_ps_data,
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 29bdcec895..d23987b20e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index fb66bceeed..b2c59d9d5f 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -163,7 +163,7 @@ WalWriterMain(void)
          */
         LWLockReleaseAll();
         ConditionVariableCancelSleep();
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         AbortBufferIO();
         UnlockBuffers();
         ReleaseAuxProcessResources(false);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 20cf33354a..1ce0809361 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
@@ -25,7 +26,6 @@
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
 #include "pgtar.h"
-#include "pgstat.h"
 #include "port.h"
 #include "postmaster/syslogger.h"
 #include "replication/basebackup.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 1e1695ef4f..b992473fd4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ada16adb67..bf7ac927f7 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index bf97dcdee4..a60ef0a9f1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..c8ffd885f5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
@@ -2494,7 +2494,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
     ondisk->size = sz;
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
     if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
     {
         int            save_errno = errno;
@@ -2508,7 +2508,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
                  errmsg("could not write to data file for XID %u: %m",
                         txn->xid)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     Assert(ondisk->change.action == change->action);
 }
@@ -2583,9 +2583,9 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
          * end of this file.
          */
         ReorderBufferSerializeReserve(rb, sizeof(ReorderBufferDiskChange));
-        pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
+        bestatus_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
         readBytes = read(*fd, rb->outbuf, sizeof(ReorderBufferDiskChange));
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         /* eof */
         if (readBytes == 0)
@@ -2612,10 +2612,10 @@ ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
                                       sizeof(ReorderBufferDiskChange) + ondisk->size);
         ondisk = (ReorderBufferDiskChange *) rb->outbuf;
 
-        pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
+        bestatus_report_wait_start(WAIT_EVENT_REORDER_BUFFER_READ);
         readBytes = read(*fd, rb->outbuf + sizeof(ReorderBufferDiskChange),
                          ondisk->size - sizeof(ReorderBufferDiskChange));
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         if (readBytes < 0)
             ereport(ERROR,
@@ -3299,9 +3299,9 @@ ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
         memset(&key, 0, sizeof(ReorderBufferTupleCidKey));
 
         /* read all mappings till the end of the file */
-        pgstat_report_wait_start(WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ);
+        bestatus_report_wait_start(WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ);
         readBytes = read(fd, &map, sizeof(LogicalRewriteMappingData));
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         if (readBytes < 0)
             ereport(ERROR,
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6cd6c67d1..e3c2d79919 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
@@ -1610,7 +1610,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
                 (errmsg("could not open file \"%s\": %m", path)));
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_WRITE);
     if ((write(fd, ondisk, needed_length)) != needed_length)
     {
         int            save_errno = errno;
@@ -1623,7 +1623,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
                 (errcode_for_file_access(),
                  errmsg("could not write to file \"%s\": %m", tmppath)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /*
      * fsync the file before renaming so that even if we crash after this we
@@ -1633,7 +1633,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
      * some noticeable overhead since it's performed synchronously during
      * decoding?
      */
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
@@ -1644,7 +1644,7 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m", tmppath)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     CloseTransientFile(fd);
 
     fsync_fname("pg_logical/snapshots", true);
@@ -1719,9 +1719,9 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 
     /* read statically sized portion of snapshot */
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
     readBytes = read(fd, &ondisk, SnapBuildOnDiskConstantSize);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != SnapBuildOnDiskConstantSize)
     {
         int            save_errno = errno;
@@ -1759,9 +1759,9 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
                 SnapBuildOnDiskConstantSize - SnapBuildOnDiskNotChecksummedSize);
 
     /* read SnapBuild */
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
     readBytes = read(fd, &ondisk.builder, sizeof(SnapBuild));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != sizeof(SnapBuild))
     {
         int            save_errno = errno;
@@ -1787,9 +1787,9 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     sz = sizeof(TransactionId) * ondisk.builder.was_running.was_xcnt_space;
     ondisk.builder.was_running.was_xip =
         MemoryContextAllocZero(builder->context, sz);
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
     readBytes = read(fd, ondisk.builder.was_running.was_xip, sz);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != sz)
     {
         int            save_errno = errno;
@@ -1814,9 +1814,9 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
     /* restore committed xacts information */
     sz = sizeof(TransactionId) * ondisk.builder.committed.xcnt;
     ondisk.builder.committed.xip = MemoryContextAllocZero(builder->context, sz);
-    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
+    bestatus_report_wait_start(WAIT_EVENT_SNAPBUILD_READ);
     readBytes = read(fd, ondisk.builder.committed.xip, sz);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != sz)
     {
         int            save_errno = errno;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 862582da23..670552593f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,25 +86,27 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 087850d089..4b7ae0d7ff 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -23,13 +23,11 @@
 
 #include "postgres.h"
 
-#include "miscadmin.h"
-#include "pgstat.h"
-#include "funcapi.h"
-
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -41,17 +39,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
@@ -463,7 +464,7 @@ apply_handle_begin(StringInfo s)
 
     in_remote_transaction = true;
 
-    pgstat_report_activity(STATE_RUNNING, NULL);
+    bestatus_report_activity(STATE_RUNNING, NULL);
 }
 
 /*
@@ -507,7 +508,7 @@ apply_handle_commit(StringInfo s)
     /* Process any tables that are being synchronized in parallel. */
     process_syncing_tables(commit_data.end_lsn);
 
-    pgstat_report_activity(STATE_IDLE, NULL);
+    bestatus_report_activity(STATE_IDLE, NULL);
 }
 
 /*
@@ -1113,7 +1114,7 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
                                                 ALLOCSET_DEFAULT_SIZES);
 
     /* mark as idle, before starting to loop */
-    pgstat_report_activity(STATE_IDLE, NULL);
+    bestatus_report_activity(STATE_IDLE, NULL);
 
     for (;;)
     {
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b30332abad..eed9d65947 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
@@ -1280,12 +1280,12 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
     FIN_CRC32C(cp.checksum);
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_WRITE);
     if ((write(fd, &cp, sizeof(cp))) != sizeof(cp))
     {
         int            save_errno = errno;
 
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         CloseTransientFile(fd);
 
         /* if write didn't set errno, assume problem is no disk space */
@@ -1296,15 +1296,15 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
                         tmppath)));
         return;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /* fsync the temporary file */
-    pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
 
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         CloseTransientFile(fd);
         errno = save_errno;
         ereport(elevel,
@@ -1313,7 +1313,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel)
                         tmppath)));
         return;
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     CloseTransientFile(fd);
 
@@ -1392,7 +1392,7 @@ RestoreSlotFromDisk(const char *name)
      * Sync state file before we're reading from it. We might have crashed
      * while it wasn't synced yet and we shouldn't continue on that basis.
      */
-    pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
@@ -1404,7 +1404,7 @@ RestoreSlotFromDisk(const char *name)
                  errmsg("could not fsync file \"%s\": %m",
                         path)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /* Also sync the parent directory */
     START_CRIT_SECTION();
@@ -1412,9 +1412,9 @@ RestoreSlotFromDisk(const char *name)
     END_CRIT_SECTION();
 
     /* read part of statefile that's guaranteed to be version independent */
-    pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_READ);
+    bestatus_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_READ);
     readBytes = read(fd, &cp, ReplicationSlotOnDiskConstantSize);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != ReplicationSlotOnDiskConstantSize)
     {
         int            saved_errno = errno;
@@ -1455,11 +1455,11 @@ RestoreSlotFromDisk(const char *name)
                         path, cp.length)));
 
     /* Now that we know the size, read the entire file */
-    pgstat_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_READ);
+    bestatus_report_wait_start(WAIT_EVENT_REPLICATION_SLOT_READ);
     readBytes = read(fd,
                      (char *) &cp + ReplicationSlotOnDiskConstantSize,
                      cp.length);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (readBytes != cp.length)
     {
         int            saved_errno = errno;
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index af5ad5fe66..957aea0a7d 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6f4b3538ac..0d65ed8f2a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2683385ca6..6fc9a8f658 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
@@ -299,7 +299,7 @@ WalSndErrorCleanup(void)
 {
     LWLockReleaseAll();
     ConditionVariableCancelSleep();
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (sendFile >= 0)
     {
@@ -504,9 +504,9 @@ SendTimeLineHistory(TimeLineHistoryCmd *cmd)
         PGAlignedBlock rbuf;
         int            nread;
 
-        pgstat_report_wait_start(WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ);
+        bestatus_report_wait_start(WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ);
         nread = read(fd, rbuf.data, sizeof(rbuf));
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         if (nread < 0)
             ereport(ERROR,
                     (errcode_for_file_access(),
@@ -1517,7 +1517,7 @@ exec_replication_command(const char *cmd_string)
     initStringInfo(&tmpbuf);
 
     /* Report to pgstat that this process is running */
-    pgstat_report_activity(STATE_RUNNING, NULL);
+    bestatus_report_activity(STATE_RUNNING, NULL);
 
     switch (cmd_node->type)
     {
@@ -1571,7 +1571,7 @@ exec_replication_command(const char *cmd_string)
                         (errmsg("cannot execute SQL commands in WAL sender for physical replication")));
 
             /* Report to pgstat that this process is now idle */
-            pgstat_report_activity(STATE_IDLE, NULL);
+            bestatus_report_activity(STATE_IDLE, NULL);
 
             /* Tell the caller that this wasn't a WalSender command. */
             return false;
@@ -1589,7 +1589,7 @@ exec_replication_command(const char *cmd_string)
     EndCommand("SELECT", DestRemote);
 
     /* Report to pgstat that this process is now idle */
-    pgstat_report_activity(STATE_IDLE, NULL);
+    bestatus_report_activity(STATE_IDLE, NULL);
 
     return true;
 }
@@ -2442,9 +2442,9 @@ retry:
         else
             segbytes = nbytes;
 
-        pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+        bestatus_report_wait_start(WAIT_EVENT_WAL_READ);
         readbytes = read(sendFile, p, segbytes);
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         if (readbytes < 0)
         {
             ereport(ERROR,
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..5fda088258
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1759 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        bestatus_track_activities = false;
+int            bestatus_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *bestatus_get_wait_activity(WaitEventActivity w);
+static const char *bestatus_get_wait_client(WaitEventClient w);
+static const char *bestatus_get_wait_ipc(WaitEventIPC w);
+static const char *bestatus_get_wait_timeout(WaitEventTimeout w);
+static const char *bestatus_get_wait_io(WaitEventIO w);
+static void bestatus_setup_memcxt(void);
+static void bestatus_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(bestatus_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(bestatus_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += bestatus_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * bestatus_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+bestatus_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(bestatus_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+bestatus_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    bestatus_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    bestatus_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * bestatus_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with bestatus_initialize).
+ * ----------
+ */
+void
+bestatus_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from bestatus_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        bestatus_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[bestatus_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    bestatus_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        bestatus_report_appname(application_name);
+}
+
+/* ----------
+ * bestatus_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+bestatus_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    bestatus_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           bestatus_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            bestatus_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            bestatus_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += bestatus_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * bestatus_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+bestatus_fetch_stat_beentry(int beid)
+{
+    bestatus_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * bestatus_fetch_stat_local_beentry() -
+ *
+ *    Like bestatus_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+bestatus_fetch_stat_local_beentry(int beid)
+{
+    bestatus_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * bestatus_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+bestatus_fetch_stat_numbackends(void)
+{
+    bestatus_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * bestatus_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+bestatus_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * bestatus_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+bestatus_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = bestatus_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = bestatus_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = bestatus_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = bestatus_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = bestatus_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * bestatus_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+bestatus_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_BESTATUS_MAIN:
+            event_name = "PgStatMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * bestatus_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+bestatus_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * bestatus_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+bestatus_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * bestatus_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+bestatus_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * bestatus_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+bestatus_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWRITE";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * bestatus_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+bestatus_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            bestatus_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            bestatus_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return bestatus_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * bestatus_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+bestatus_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - bestatus_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, bestatus_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+bestatus_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * bestatus_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+bestatus_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    bestatus_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    bestatus_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+bestatus_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!bestatus_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    bestatus_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    bestatus_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * bestatus_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+bestatus_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * bestatus_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * bestatus_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+bestatus_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!bestatus_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            bestatus_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            bestatus_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), bestatus_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    bestatus_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    bestatus_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * bestatus_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+bestatus_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !bestatus_track_activities)
+        return;
+
+    bestatus_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    bestatus_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * bestatus_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+bestatus_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < BESTATUS_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !bestatus_track_activities)
+        return;
+
+    bestatus_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    bestatus_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * bestatus_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+bestatus_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !bestatus_track_activities || nparam == 0)
+        return;
+
+    bestatus_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < BESTATUS_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    bestatus_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * bestatus_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+bestatus_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!bestatus_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    bestatus_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    bestatus_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+bestatus_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like bestatus_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be bestatus_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, bestatus_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           bestatus_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/statmon/pgstat.c
similarity index 70%
rename from src/backend/postmaster/pgstat.c
rename to src/backend/statmon/pgstat.c
index 2d3f7cb898..df7995fb74 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -8,7 +8,7 @@
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
- *    src/backend/postmaster/pgstat.c
+ *    src/backend/statmon/pgstat.c
  * ----------
  */
 #include "postgres.h"
@@ -21,19 +21,14 @@
 #include "access/htup_details.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -68,26 +63,12 @@ typedef enum
     PGSTAT_ENTRY_LOCK_FAILED
 } pg_stat_table_result_status;
 
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
 /* ----------
  * GUC parameters
  * ----------
  */
-bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
@@ -125,6 +106,8 @@ static bool pgstat_pending_recoveryconflict = false;
 static bool pgstat_pending_deadlock = false;
 static bool pgstat_pending_tempfile = false;
 
+static MemoryContext pgStatLocalContext = NULL;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -236,15 +219,8 @@ typedef struct
 /*
  * Info about current "snapshot" of stats file
  */
-static MemoryContext pgStatLocalContext = NULL;
 static HTAB *pgStatDBHash = NULL;
 
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
 /*
  * Cluster wide statistics.
  * Contains statistics that are not collected per database or per table.
@@ -280,7 +256,6 @@ static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dsha
 /* functions used in backends */
 static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
-static void pgstat_read_current_status(void);
 
 static void pgstat_postmaster_shutdown(int code, Datum arg);
 static void pgstat_apply_pending_tabstats(bool shared, bool force,
@@ -307,12 +282,6 @@ static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
 
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
 static bool pgstat_update_tabentry(dshash_table *tabhash,
                                    PgStat_TableStatus *stat, bool nowait);
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
@@ -323,6 +292,14 @@ static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
  * ------------------------------------------------------------
  */
 
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
 /*
  * subroutine for pgstat_reset_all
  */
@@ -2611,66 +2588,6 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     return funcentry;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
@@ -2708,364 +2625,6 @@ pgstat_fetch_global(void)
     return snapshot_globalStats;
 }
 
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    before_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3078,8 +2637,6 @@ pgstat_bestart(void)
 static void
 pgstat_beshutdown_hook(int code, Datum arg)
 {
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
     /*
      * If we got as far as discovering our own database ID, we can report what
      * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3088,1188 +2645,9 @@ pgstat_beshutdown_hook(int code, Datum arg)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_update_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
 }
 
 
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(IsUnderPostmaster);
-
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWRITE";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
 /* ------------------------------------------------------------
  * Local support functions follow
  * ------------------------------------------------------------
@@ -5412,22 +3790,6 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               funcid);
 }
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5443,6 +3805,8 @@ pgstat_clear_snapshot(void)
 {
     int param = 0;    /* only the address is significant */
 
+    bestatus_clear_snapshot();
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5450,8 +3814,6 @@ pgstat_clear_snapshot(void)
     /* Reset variables */
     pgStatLocalContext = NULL;
     pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
 
     /*
      * the parameter inform the function that it is not called from
@@ -5557,47 +3919,18 @@ pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
+/* ----------
+ * pgstat_setup_memcxt() -
  *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
  */
-char *
-pgstat_clip_activity(const char *raw_activity)
+static void
+pgstat_setup_memcxt(void)
 {
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
 }
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e794a81c4c..d92c7c935d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index e93813d973..24c15be240 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 4a0d23b11e..bee2923068 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
@@ -186,9 +186,9 @@ copy_file(char *fromfile, char *tofile)
             flush_offset = offset;
         }
 
-        pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
+        bestatus_report_wait_start(WAIT_EVENT_COPY_FILE_READ);
         nbytes = read(srcfd, buffer, COPY_BUF_SIZE);
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         if (nbytes < 0)
             ereport(ERROR,
                     (errcode_for_file_access(),
@@ -196,10 +196,10 @@ copy_file(char *fromfile, char *tofile)
         if (nbytes == 0)
             break;
         errno = 0;
-        pgstat_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
+        bestatus_report_wait_start(WAIT_EVENT_COPY_FILE_WRITE);
         if ((int) write(dstfd, buffer, nbytes) != nbytes)
         {
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
             /* if write didn't set errno, assume problem is no disk space */
             if (errno == 0)
                 errno = ENOSPC;
@@ -207,7 +207,7 @@ copy_file(char *fromfile, char *tofile)
                     (errcode_for_file_access(),
                      errmsg("could not write to file \"%s\": %m", tofile)));
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
     }
 
     if (offset > flush_offset)
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f1767..c0e973c953 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
@@ -1844,10 +1845,10 @@ FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info)
     if (returnCode < 0)
         return returnCode;
 
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
                                POSIX_FADV_WILLNEED);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     return returnCode;
 #else
@@ -1878,9 +1879,9 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
     if (returnCode < 0)
         return;
 
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     pg_flush_data(VfdCache[file].fd, offset, nbytes);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 }
 
 int
@@ -1903,9 +1904,9 @@ FileRead(File file, char *buffer, int amount, uint32 wait_event_info)
     vfdP = &VfdCache[file];
 
 retry:
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     returnCode = read(vfdP->fd, buffer, amount);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (returnCode >= 0)
     {
@@ -2006,9 +2007,9 @@ FileWrite(File file, char *buffer, int amount, uint32 wait_event_info)
 
 retry:
     errno = 0;
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     returnCode = write(vfdP->fd, buffer, amount);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /* if write didn't set errno, assume problem is no disk space */
     if (returnCode != amount && errno == 0)
@@ -2082,9 +2083,9 @@ FileSync(File file, uint32 wait_event_info)
     if (returnCode < 0)
         return returnCode;
 
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     returnCode = pg_fsync(VfdCache[file].fd);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     return returnCode;
 }
@@ -2191,9 +2192,9 @@ FileTruncate(File file, off_t offset, uint32 wait_event_info)
     if (returnCode < 0)
         return returnCode;
 
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
     returnCode = ftruncate(VfdCache[file].fd, offset);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (returnCode == 0 && VfdCache[file].fileSize > offset)
     {
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 70f899e765..6616751378 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
@@ -970,12 +970,12 @@ dsm_impl_mmap(dsm_op op, dsm_handle handle, Size request_size,
 
             if (goal > ZBUFFER_SIZE)
                 goal = ZBUFFER_SIZE;
-            pgstat_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
+            bestatus_report_wait_start(WAIT_EVENT_DSM_FILL_ZERO_WRITE);
             if (write(fd, zbuffer, goal) == goal)
                 remaining -= goal;
             else
                 success = false;
-            pgstat_report_wait_end();
+            bestatus_report_wait_end();
         }
 
         if (!success)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f6dda9cc9a..3b646a4cbb 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
@@ -940,7 +940,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
         cur_timeout = timeout;
     }
 
-    pgstat_report_wait_start(wait_event_info);
+    bestatus_report_wait_start(wait_event_info);
 
 #ifndef WIN32
     waiting = true;
@@ -1019,7 +1019,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
     waiting = false;
 #endif
 
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     return returned_events;
 }
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 908f62d37e..8c86239463 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
@@ -515,7 +515,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
         int            extraWaits = 0;
 
         /* Sleep until the leader clears our XID. */
-        pgstat_report_wait_start(WAIT_EVENT_PROCARRAY_GROUP_UPDATE);
+        bestatus_report_wait_start(WAIT_EVENT_PROCARRAY_GROUP_UPDATE);
         for (;;)
         {
             /* acts as a read barrier */
@@ -524,7 +524,7 @@ ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
                 break;
             extraWaits++;
         }
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
 
         Assert(pg_atomic_read_u32(&proc->procArrayGroupNext) == INVALID_PGPROCNO);
 
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index fde71afd47..a0a5582aac 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index c9bb3e987d..0a9181cd9d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..a473dee3a8 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -1127,7 +1128,7 @@ DeadLockReport(void)
         appendStringInfo(&logbuf,
                          _("Process %d: %s"),
                          info->pid,
-                         pgstat_get_backend_current_activity(info->pid, false));
+                         bestatus_get_backend_current_activity(info->pid, false));
     }
 
     pgstat_report_deadlock();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c46bb8d057..01597d1f93 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
@@ -697,7 +697,7 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 static inline void
 LWLockReportWaitStart(LWLock *lock)
 {
-    pgstat_report_wait_start(PG_WAIT_LWLOCK | lock->tranche);
+    bestatus_report_wait_start(PG_WAIT_LWLOCK | lock->tranche);
 }
 
 /*
@@ -706,7 +706,7 @@ LWLockReportWaitStart(LWLock *lock)
 static inline void
 LWLockReportWaitEnd(void)
 {
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8390311d0..ac352885f3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f9aaa52fa..0ecaa24b1a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077b..625e50e129 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ee4e43331b..16148a3298 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -997,7 +998,7 @@ exec_simple_query(const char *query_string)
      */
     debug_query_string = query_string;
 
-    pgstat_report_activity(STATE_RUNNING, query_string);
+    bestatus_report_activity(STATE_RUNNING, query_string);
 
     TRACE_POSTGRESQL_QUERY_START(query_string);
 
@@ -1332,7 +1333,7 @@ exec_parse_message(const char *query_string,    /* string to execute */
      */
     debug_query_string = query_string;
 
-    pgstat_report_activity(STATE_RUNNING, query_string);
+    bestatus_report_activity(STATE_RUNNING, query_string);
 
     set_ps_display("PARSE", false);
 
@@ -1625,7 +1626,7 @@ exec_bind_message(StringInfo input_message)
      */
     debug_query_string = psrc->query_string;
 
-    pgstat_report_activity(STATE_RUNNING, psrc->query_string);
+    bestatus_report_activity(STATE_RUNNING, psrc->query_string);
 
     set_ps_display("BIND", false);
 
@@ -2027,7 +2028,7 @@ exec_execute_message(const char *portal_name, long max_rows)
      */
     debug_query_string = sourceText;
 
-    pgstat_report_activity(STATE_RUNNING, sourceText);
+    bestatus_report_activity(STATE_RUNNING, sourceText);
 
     set_ps_display(portal->commandTag, false);
 
@@ -4142,7 +4143,7 @@ PostgresMain(int argc, char *argv[],
             if (IsAbortedTransactionBlockState())
             {
                 set_ps_display("idle in transaction (aborted)", false);
-                pgstat_report_activity(STATE_IDLEINTRANSACTION_ABORTED, NULL);
+                bestatus_report_activity(STATE_IDLEINTRANSACTION_ABORTED, NULL);
 
                 /* Start the idle-in-transaction timer */
                 if (IdleInTransactionSessionTimeout > 0)
@@ -4155,7 +4156,7 @@ PostgresMain(int argc, char *argv[],
             else if (IsTransactionOrTransactionBlock())
             {
                 set_ps_display("idle in transaction", false);
-                pgstat_report_activity(STATE_IDLEINTRANSACTION, NULL);
+                bestatus_report_activity(STATE_IDLEINTRANSACTION, NULL);
 
                 /* Start the idle-in-transaction timer */
                 if (IdleInTransactionSessionTimeout > 0)
@@ -4179,7 +4180,7 @@ PostgresMain(int argc, char *argv[],
                                          stats_timeout);
                 }
                 set_ps_display("idle", false);
-                pgstat_report_activity(STATE_IDLE, NULL);
+                bestatus_report_activity(STATE_IDLE, NULL);
             }
 
             ReadyForQuery(whereToSendOutput);
@@ -4333,7 +4334,7 @@ PostgresMain(int argc, char *argv[],
                 SetCurrentStatementStartTimestamp();
 
                 /* Report query to various monitoring facilities. */
-                pgstat_report_activity(STATE_FASTPATH, NULL);
+                bestatus_report_activity(STATE_FASTPATH, NULL);
                 set_ps_display("<FASTPATH>", false);
 
                 /* start an xact for this function invocation */
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 309eb2935c..b229f42622 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -20,6 +20,7 @@
 #include <unistd.h>
 
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -28,7 +29,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eca801eeed..1673aeba93 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -411,7 +412,7 @@ pg_stat_get_backend_idset(PG_FUNCTION_ARGS)
         funcctx->user_fctx = fctx;
 
         fctx[0] = 0;
-        fctx[1] = pgstat_fetch_stat_numbackends();
+        fctx[1] = bestatus_fetch_stat_numbackends();
     }
 
     /* stuff done on every call of the function */
@@ -439,8 +440,8 @@ pg_stat_get_backend_idset(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_PROGRESS_COLS    PGSTAT_NUM_PROGRESS_PARAM + 3
-    int            num_backends = pgstat_fetch_stat_numbackends();
+#define PG_STAT_GET_PROGRESS_COLS    BESTATUS_NUM_PROGRESS_PARAM + 3
+    int            num_backends = bestatus_fetch_stat_numbackends();
     int            curr_backend;
     char       *cmd = text_to_cstring(PG_GETARG_TEXT_PP(0));
     ProgressCommandType cmdtype;
@@ -494,7 +495,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
 
-        local_beentry = pgstat_fetch_stat_local_beentry(curr_backend);
+        local_beentry = bestatus_fetch_stat_local_beentry(curr_backend);
 
         if (!local_beentry)
             continue;
@@ -516,13 +517,13 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
         if (has_privs_of_role(GetUserId(), beentry->st_userid))
         {
             values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
-            for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
+            for (i = 0; i < BESTATUS_NUM_PROGRESS_PARAM; i++)
                 values[i + 3] = Int64GetDatum(beentry->st_progress_param[i]);
         }
         else
         {
             nulls[2] = true;
-            for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)
+            for (i = 0; i < BESTATUS_NUM_PROGRESS_PARAM; i++)
                 nulls[i + 3] = true;
         }
 
@@ -542,7 +543,7 @@ Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
 #define PG_STAT_GET_ACTIVITY_COLS    24
-    int            num_backends = pgstat_fetch_stat_numbackends();
+    int            num_backends = bestatus_fetch_stat_numbackends();
     int            curr_backend;
     int            pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
@@ -592,7 +593,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
         MemSet(nulls, 0, sizeof(nulls));
 
         /* Get the next one in the list */
-        local_beentry = pgstat_fetch_stat_local_beentry(curr_backend);
+        local_beentry = bestatus_fetch_stat_local_beentry(curr_backend);
         if (!local_beentry)
         {
             int            i;
@@ -692,7 +693,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
                     break;
             }
 
-            clipped_activity = pgstat_clip_activity(beentry->st_activity_raw);
+            clipped_activity = bestatus_clip_activity(beentry->st_activity_raw);
             values[5] = CStringGetTextDatum(clipped_activity);
             pfree(clipped_activity);
 
@@ -702,8 +703,8 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
                 uint32        raw_wait_event;
 
                 raw_wait_event = UINT32_ACCESS_ONCE(proc->wait_event_info);
-                wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
-                wait_event = pgstat_get_wait_event(raw_wait_event);
+                wait_event_type = bestatus_get_wait_event_type(raw_wait_event);
+                wait_event = bestatus_get_wait_event(raw_wait_event);
 
             }
             else if (beentry->st_backendType != B_BACKEND)
@@ -721,8 +722,8 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
                     raw_wait_event =
                         UINT32_ACCESS_ONCE(proc->wait_event_info);
                     wait_event_type =
-                        pgstat_get_wait_event_type(raw_wait_event);
-                    wait_event = pgstat_get_wait_event(raw_wait_event);
+                        bestatus_get_wait_event_type(raw_wait_event);
+                    wait_event = bestatus_get_wait_event(raw_wait_event);
                 }
             }
 
@@ -836,7 +837,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
             }
             else
                 values[17] =
-                    CStringGetTextDatum(pgstat_get_backend_desc(beentry->st_backendType));
+                    CStringGetTextDatum(bestatus_get_backend_desc(beentry->st_backendType));
         }
         else
         {
@@ -882,7 +883,7 @@ pg_stat_get_backend_pid(PG_FUNCTION_ARGS)
     int32        beid = PG_GETARG_INT32(0);
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     PG_RETURN_INT32(beentry->st_procpid);
@@ -895,7 +896,7 @@ pg_stat_get_backend_dbid(PG_FUNCTION_ARGS)
     int32        beid = PG_GETARG_INT32(0);
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     PG_RETURN_OID(beentry->st_databaseid);
@@ -908,7 +909,7 @@ pg_stat_get_backend_userid(PG_FUNCTION_ARGS)
     int32        beid = PG_GETARG_INT32(0);
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     PG_RETURN_OID(beentry->st_userid);
@@ -924,7 +925,7 @@ pg_stat_get_backend_activity(PG_FUNCTION_ARGS)
     char       *clipped_activity;
     text       *ret;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         activity = "<backend information not available>";
     else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
         activity = "<insufficient privilege>";
@@ -933,7 +934,7 @@ pg_stat_get_backend_activity(PG_FUNCTION_ARGS)
     else
         activity = beentry->st_activity_raw;
 
-    clipped_activity = pgstat_clip_activity(activity);
+    clipped_activity = bestatus_clip_activity(activity);
     ret = cstring_to_text(activity);
     pfree(clipped_activity);
 
@@ -948,12 +949,12 @@ pg_stat_get_backend_wait_event_type(PG_FUNCTION_ARGS)
     PGPROC       *proc;
     const char *wait_event_type = NULL;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         wait_event_type = "<backend information not available>";
     else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
         wait_event_type = "<insufficient privilege>";
     else if ((proc = BackendPidGetProc(beentry->st_procpid)) != NULL)
-        wait_event_type = pgstat_get_wait_event_type(proc->wait_event_info);
+        wait_event_type = bestatus_get_wait_event_type(proc->wait_event_info);
 
     if (!wait_event_type)
         PG_RETURN_NULL();
@@ -969,12 +970,12 @@ pg_stat_get_backend_wait_event(PG_FUNCTION_ARGS)
     PGPROC       *proc;
     const char *wait_event = NULL;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         wait_event = "<backend information not available>";
     else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
         wait_event = "<insufficient privilege>";
     else if ((proc = BackendPidGetProc(beentry->st_procpid)) != NULL)
-        wait_event = pgstat_get_wait_event(proc->wait_event_info);
+        wait_event = bestatus_get_wait_event(proc->wait_event_info);
 
     if (!wait_event)
         PG_RETURN_NULL();
@@ -990,7 +991,7 @@ pg_stat_get_backend_activity_start(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     if (!has_privs_of_role(GetUserId(), beentry->st_userid))
@@ -1016,7 +1017,7 @@ pg_stat_get_backend_xact_start(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     if (!has_privs_of_role(GetUserId(), beentry->st_userid))
@@ -1038,7 +1039,7 @@ pg_stat_get_backend_start(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgBackendStatus *beentry;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     if (!has_privs_of_role(GetUserId(), beentry->st_userid))
@@ -1062,7 +1063,7 @@ pg_stat_get_backend_client_addr(PG_FUNCTION_ARGS)
     char        remote_host[NI_MAXHOST];
     int            ret;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     if (!has_privs_of_role(GetUserId(), beentry->st_userid))
@@ -1109,7 +1110,7 @@ pg_stat_get_backend_client_port(PG_FUNCTION_ARGS)
     char        remote_port[NI_MAXSERV];
     int            ret;
 
-    if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
+    if ((beentry = bestatus_fetch_stat_beentry(beid)) == NULL)
         PG_RETURN_NULL();
 
     if (!has_privs_of_role(GetUserId(), beentry->st_userid))
@@ -1153,13 +1154,13 @@ pg_stat_get_db_numbackends(PG_FUNCTION_ARGS)
 {
     Oid            dbid = PG_GETARG_OID(0);
     int32        result;
-    int            tot_backends = pgstat_fetch_stat_numbackends();
+    int            tot_backends = bestatus_fetch_stat_numbackends();
     int            beid;
 
     result = 0;
     for (beid = 1; beid <= tot_backends; beid++)
     {
-        PgBackendStatus *beentry = pgstat_fetch_stat_beentry(beid);
+        PgBackendStatus *beentry = bestatus_fetch_stat_beentry(beid);
 
         if (beentry && beentry->st_databaseid == dbid)
             result++;
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 905867dc76..3303d9ce35 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
@@ -731,7 +731,7 @@ load_relmap_file(bool shared)
      * look, the sinval signaling mechanism will make us re-read it before we
      * are able to access any relation that's affected by the change.
      */
-    pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
+    bestatus_report_wait_start(WAIT_EVENT_RELATION_MAP_READ);
     r = read(fd, map, sizeof(RelMapFile));
     if (r != sizeof(RelMapFile))
     {
@@ -745,7 +745,7 @@ load_relmap_file(bool shared)
                      errmsg("could not read file \"%s\": read %d of %zu",
                             mapfilename, r, sizeof(RelMapFile))));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     CloseTransientFile(fd);
 
@@ -855,7 +855,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
     }
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE);
     if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile))
     {
         /* if write didn't set errno, assume problem is no disk space */
@@ -866,7 +866,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
                  errmsg("could not write file \"%s\": %m",
                         mapfilename)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     /*
      * We choose to fsync the data to disk before considering the task done.
@@ -874,13 +874,13 @@ write_relmap_file(bool shared, RelMapFile *newmap,
      * issue, but it would complicate checkpointing --- see notes for
      * CheckPointRelationMap.
      */
-    pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_RELATION_MAP_SYNC);
     if (pg_fsync(fd) != 0)
         ereport(ERROR,
                 (errcode_for_file_access(),
                  errmsg("could not fsync file \"%s\": %m",
                         mapfilename)));
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
     if (CloseTransientFile(fd))
         ereport(ERROR,
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 238fe1deec..a00e5618f2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
@@ -961,13 +961,13 @@ CreateLockFile(const char *filename, bool amPostmaster,
                      errmsg("could not open lock file \"%s\": %m",
                             filename)));
         }
-        pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_READ);
+        bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_READ);
         if ((len = read(fd, buffer, sizeof(buffer) - 1)) < 0)
             ereport(FATAL,
                     (errcode_for_file_access(),
                      errmsg("could not read lock file \"%s\": %m",
                             filename)));
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         close(fd);
 
         if (len == 0)
@@ -1113,7 +1113,7 @@ CreateLockFile(const char *filename, bool amPostmaster,
         strlcat(buffer, "\n", sizeof(buffer));
 
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_WRITE);
     if (write(fd, buffer, strlen(buffer)) != strlen(buffer))
     {
         int            save_errno = errno;
@@ -1126,9 +1126,9 @@ CreateLockFile(const char *filename, bool amPostmaster,
                 (errcode_for_file_access(),
                  errmsg("could not write lock file \"%s\": %m", filename)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
 
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_SYNC);
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_CREATE_SYNC);
     if (pg_fsync(fd) != 0)
     {
         int            save_errno = errno;
@@ -1140,7 +1140,7 @@ CreateLockFile(const char *filename, bool amPostmaster,
                 (errcode_for_file_access(),
                  errmsg("could not write lock file \"%s\": %m", filename)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (close(fd) != 0)
     {
         int            save_errno = errno;
@@ -1274,9 +1274,9 @@ AddToDataDirLockFile(int target_line, const char *str)
                         DIRECTORY_LOCK_FILE)));
         return;
     }
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ);
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ);
     len = read(fd, srcbuffer, sizeof(srcbuffer) - 1);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (len < 0)
     {
         ereport(LOG,
@@ -1336,11 +1336,11 @@ AddToDataDirLockFile(int target_line, const char *str)
      */
     len = strlen(destbuffer);
     errno = 0;
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE);
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE);
     if (lseek(fd, (off_t) 0, SEEK_SET) != 0 ||
         (int) write(fd, destbuffer, len) != len)
     {
-        pgstat_report_wait_end();
+        bestatus_report_wait_end();
         /* if write didn't set errno, assume problem is no disk space */
         if (errno == 0)
             errno = ENOSPC;
@@ -1351,8 +1351,8 @@ AddToDataDirLockFile(int target_line, const char *str)
         close(fd);
         return;
     }
-    pgstat_report_wait_end();
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC);
+    bestatus_report_wait_end();
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC);
     if (pg_fsync(fd) != 0)
     {
         ereport(LOG,
@@ -1360,7 +1360,7 @@ AddToDataDirLockFile(int target_line, const char *str)
                  errmsg("could not write to file \"%s\": %m",
                         DIRECTORY_LOCK_FILE)));
     }
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (close(fd) != 0)
     {
         ereport(LOG,
@@ -1417,9 +1417,9 @@ RecheckDataDirLockFile(void)
                 return true;
         }
     }
-    pgstat_report_wait_start(WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ);
+    bestatus_report_wait_start(WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ);
     len = read(fd, buffer, sizeof(buffer) - 1);
-    pgstat_report_wait_end();
+    bestatus_report_wait_end();
     if (len < 0)
     {
         ereport(LOG,
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1e4fa89135..9465c7d624 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -25,6 +25,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -688,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
         pgstat_initialize();
+        bestatus_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
@@ -710,7 +714,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
     if (IsAutoVacuumLauncherProcess())
     {
         /* report this backend in the PgBackendStatus array */
-        pgstat_bestart();
+        bestatus_bestart();
 
         return;
     }
@@ -857,7 +861,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         InitializeClientEncoding();
 
         /* report this backend in the PgBackendStatus array */
-        pgstat_bestart();
+        bestatus_bestart();
 
         /* close the transaction we started above */
         CommitTransactionCommand();
@@ -923,7 +927,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
          */
         if (!bootstrap)
         {
-            pgstat_bestart();
+            bestatus_bestart();
             CommitTransactionCommand();
         }
         return;
@@ -1075,7 +1079,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* report this backend in the PgBackendStatus array */
     if (!bootstrap)
-        pgstat_bestart();
+        bestatus_bestart();
 
     /* close the transaction we started above */
     if (!bootstrap)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 37f3389bd0..36e84a13e8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
@@ -1291,7 +1292,7 @@ static struct config_bool ConfigureNamesBool[] =
                          "executing command of each session, along with "
                          "the time at which that command began execution.")
         },
-        &pgstat_track_activities,
+        &bestatus_track_activities,
         true,
         NULL, NULL, NULL
     },
@@ -3048,7 +3049,7 @@ static struct config_int ConfigureNamesInt[] =
             NULL,
             GUC_UNIT_BYTE
         },
-        &pgstat_track_activity_query_size,
+        &bestatus_track_activity_query_size,
         1024, 100, 102400,
         NULL, NULL, NULL
     },
@@ -10728,7 +10729,7 @@ static void
 assign_application_name(const char *newval, void *extra)
 {
     /* Update the pg_stat_activity view */
-    pgstat_report_appname(newval);
+    bestatus_report_appname(newval);
 }
 
 static bool
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..72b3e9f5f1
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,545 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_BESTATUS_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define BESTATUS_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * bestatus_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[BESTATUS_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * bestatus_increment_changecount_before() and
+ * bestatus_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also bestatus_save_changecount_before() and bestatus_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define bestatus_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define bestatus_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define bestatus_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define bestatus_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool bestatus_track_activities;
+extern PGDLLIMPORT int bestatus_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void bestatus_clear_snapshot(void);
+extern void bestatus_initialize(void);
+extern void bestatus_bestart(void);
+
+extern const char *bestatus_get_wait_event(uint32 wait_event_info);
+extern const char *bestatus_get_wait_event_type(uint32 wait_event_info);
+extern const char *bestatus_get_backend_current_activity(int pid, bool checkUser);
+extern const char *bestatus_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *bestatus_get_backend_desc(BackendType backendType);
+
+extern void bestatus_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void bestatus_progress_update_param(int index, int64 val);
+extern void bestatus_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void bestatus_progress_end_command(void);
+
+extern char *bestatus_clip_activity(const char *raw_activity);
+
+/* ----------
+ * bestatus_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+bestatus_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!bestatus_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * bestatus_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+bestatus_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!bestatus_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *bestatus_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *bestatus_fetch_stat_local_beentry(int beid);
+extern int    bestatus_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void bestatus_report_xact_timestamp(TimestampTz tstamp);
+void bestatus_bestat_initialize(void);
+
+extern void bestatus_report_activity(BackendState state, const char *cmd_str);
+extern void bestatus_report_appname(const char *appname);
+extern void bestatus_report_xact_timestamp(TimestampTz tstamp);
+extern const char *bestatus_get_wait_event(uint32 wait_event_info);
+extern const char *bestatus_get_wait_event_type(uint32 wait_event_info);
+extern const char *bestatus_get_backend_current_activity(int pid, bool checkUser);
+extern const char *bestatus_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *bestatus_get_backend_desc(BackendType backendType);
+
+extern void bestatus_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void bestatus_progress_update_param(int index, int64 val);
+extern void bestatus_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void bestatus_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e51580076..4d1fc422ab 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -14,11 +14,8 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "lib/dshash.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -94,12 +91,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -167,10 +163,10 @@ typedef struct PgStat_BgWriter
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -203,7 +199,7 @@ typedef struct PgStat_FunctionEntry
 } PgStat_FunctionEntry;
 
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -307,7 +303,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -323,7 +319,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -341,422 +337,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -778,10 +358,8 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -828,26 +406,9 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_clear_snapshot(void);
 
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
@@ -858,60 +419,6 @@ extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
 extern HTAB *backend_snapshot_all_db_entries(void);
 extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -979,6 +486,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_update_archiver(const char *xlog, bool failed);
 extern void pgstat_update_bgwriter(void);
 
+extern void pgstat_report_tempfile(size_t filesize);
+
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
@@ -986,10 +495,7 @@ extern void pgstat_update_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-- 
2.16.3


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:

On 11/8/18 12:46 PM, Kyotaro HORIGUCHI wrote:
> Hello. Thank you for looking this.
> 
> At Tue, 30 Oct 2018 01:49:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<5253d750-890b-069b-031f-2a9b73e47832@2ndquadrant.com>
>> Hi,
>>
>> I've started looking at the patch over the past few days. I don't have
>> any deep insights at this point, but there seems to be some sort of
>> issue in pgstat_update_stat. When building using gcc, I do get this
>> warning:
>>
>> pgstat.c: In function ‘pgstat_update_stat’:
>> pgstat.c:648:18: warning: ‘now’ may be used uninitialized in this
>> function [-Wmaybe-uninitialized]
>>     oldest_pending = now;
>>     ~~~~~~~~~~~~~~~^~~~~
>> PostgreSQL installation complete.
> 
> Uggh! The reason for the code is "last_report = now" comes later
> than the code around... Fixed.
> 
>> When running this under valgrind, I get a couple of warnings in this
>> area of code - see the attached log with a small sample. Judging by
>> the locations I assume those are related to the same issue, but I have
>> not looked into that.
> 
> There was several typos/thinkos related to pointers modifed from
> original variables. There was a code like the following in the
> original code.
> 
>    memset(&shared_globalStats, 0, siazeof(shared_globalStats));
> 
> It was not fixed despite this patch changes the type of the
> variable from PgStat_GlboalStats to (PgStat_GlobalStats *). As
> the result major part of the varialbe remaineduninitialized.
> 
> I re-ran this version on valgrind and I didn't see such kind of
> problem. Thank you for the testing.
> 

OK, regression tests now seem to pass without any valgrind issues.

However a quite a few extensions in contrib seem are broken now. It 
seems fixing it is as simple as including the new bestatus.h next to 
pgstat.h.

I'm not sure splitting the headers like this is needed, actually. It's 
true we're replacing pgstat.c with something else, but it's still 
related to stats, backing pg_stat_* system views etc. So I'd keep as 
much of the definitions in pgstat.h, so that it's enough to include that 
one header file. That would "unbreak" the extensions.

Renaming pgstat_report_* functions to bestatus_report_* seems 
unnecessary to me too. The original names seem quite fine to me.

BTW the updated patches no longer apply cleanly. Apparently it got 
broken since Tuesday, most likely by the pread/pwrite patch.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2018-Nov-08, Tomas Vondra wrote:

> I'm not sure splitting the headers like this is needed, actually. It's true
> we're replacing pgstat.c with something else, but it's still related to
> stats, backing pg_stat_* system views etc. So I'd keep as much of the
> definitions in pgstat.h, so that it's enough to include that one header
> file. That would "unbreak" the extensions.

pgstat.h includes a lot of other stuff that presumably isn't needed if
all some .c wants is in bestatus.h, so my vote would be to make this
change *if it's actually possible to do it*: you want the affected
headers to compile standalone (use cpluspluscheck or similar to verify
this), for one thing.

> Renaming pgstat_report_* functions to bestatus_report_* seems unnecessary to
> me too. The original names seem quite fine to me.

Yeah, this probably keeps churn to a minimum.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. This is rebased version.

At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>
> However a quite a few extensions in contrib seem are broken now. It
> seems fixing it is as simple as including the new bestatus.h next to
> pgstat.h.

The additional 0009 does that.

At Thu, 8 Nov 2018 12:39:41 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20181108153941.txjb6rg3y7q26ldm@alvherre.pgsql>
> On 2018-Nov-08, Tomas Vondra wrote:
> 
> > I'm not sure splitting the headers like this is needed, actually. It's true
> > we're replacing pgstat.c with something else, but it's still related to
> > stats, backing pg_stat_* system views etc. So I'd keep as much of the
> > definitions in pgstat.h, so that it's enough to include that one header
> > file. That would "unbreak" the extensions.
>
> pgstat.h includes a lot of other stuff that presumably isn't needed if
> all some .c wants is in bestatus.h, so my vote would be to make this
> change *if it's actually possible to do it*: you want the affected
> headers to compile standalone (use cpluspluscheck or similar to verify
> this), for one thing.

cpluspluscheck doesn't complain to this change. I'm afraid that I
didn't read you cleary, but I counted up files that needs only
pgstat/ only bestatus / both in $(TOPDIR). (containing contrib)

only (new) pgstat.h : 33
only bestatus.h     : 47
both                : 22


> > Renaming pgstat_report_* functions to bestatus_report_* seems unnecessary to
> > me too. The original names seem quite fine to me.
> 
> Yeah, this probably keeps churn to a minimum.

Reverted the names. pgstat_intialize() and
pgstat_clear_snapshot() were splitted into two functions for each
pgstat.c and bestatus.c. (and pgstat_clear_snapstop() is calling
pgstat_bestatus_clear_snapshot()!, and the names of the functions
looks somewhat inconsistent. Will fix later.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 08b691072c8109d102b7908d8a2bb19c7338255c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/9] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b2b8fe60e1..af904c034e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8ab1a21f3e 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 3abcfd05c76057eb6b02c58f6b0604ca940e0023 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/9] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index af904c034e..d8bdaecae5 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,6 +394,17 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, NULL, exclusive, false);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool *lock_acquired, bool exclusive, bool nowait)
 {
     dshash_hash hash;
     size_t        partition;
@@ -405,8 +416,23 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +467,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +497,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8ab1a21f3e..475d22ab55 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool *lock_acquired, bool exclusive, bool nowait);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 8d3b7f5b6a5cce79e4c3a1ec9b9e5abf45ccd158 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:58:32 +0900
Subject: [PATCH 3/9] Shared-memory based stats collector

This replaces the means to share server stats numbers from file to
dynamic shared memory. Every backend directly reads and writres to the
stats tables. Stats collector process is removed and archiver process
was changed to an auxiliary process in order to access shared memory.
Update of shared stats happens with the intervals not shorter than
500ms and no longer than 1s. If the shared stats hash is busy and a
backend cannot obtain lock on the shared stats, usually the numbers
are stashed into "pending stats" on local memory and merged with the
next writing.
---
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/postmaster/autovacuum.c           |   59 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/pgarch.c               |    4 +-
 src/backend/postmaster/pgstat.c               | 4208 ++++++++++---------------
 src/backend/postmaster/postmaster.c           |   82 +-
 src/backend/replication/basebackup.c          |   36 -
 src/backend/replication/logical/tablesync.c   |    9 +-
 src/backend/replication/logical/worker.c      |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/dsm.c                 |   24 +-
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/tcop/postgres.c                   |   27 +-
 src/backend/utils/adt/pgstatfuncs.c           |   50 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    2 +-
 src/include/pgstat.h                          |  438 +--
 src/include/storage/dsm.h                     |    3 +
 src/include/storage/lwlock.h                  |    3 +
 src/include/utils/timeout.h                   |    1 +
 src/test/modules/worker_spi/worker_spi.c      |    2 +-
 29 files changed, 1928 insertions(+), 3131 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2..e52ae54821 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 978089575b..10e707e9a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -977,7 +977,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -986,6 +986,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -1001,7 +1002,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1013,6 +1014,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1025,7 +1027,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1037,6 +1039,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1235,7 +1238,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1273,16 +1276,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1971,7 +1980,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2021,7 +2030,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2107,6 +2116,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2186,10 +2197,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2758,12 +2770,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2795,8 +2805,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2827,6 +2837,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2917,7 +2929,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c53..a4b1079e60 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -271,9 +271,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..9235390bc6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -376,7 +376,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -402,7 +402,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -520,13 +520,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -694,9 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1296,8 +1296,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 885e85ad8a..3ca36d62a4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -466,7 +466,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -476,7 +476,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 42bccce0af..cca64eca83 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -19,92 +14,59 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -127,32 +89,64 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
 
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
 
-static struct sockaddr_storage pgStatAddr;
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
@@ -189,18 +183,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -237,6 +227,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -250,23 +246,15 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,32 +268,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -316,320 +313,16 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
 /*
  * subroutine for pgstat_reset_all
  */
@@ -678,119 +371,54 @@ pgstat_reset_remove_files(const char *directory)
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
+/* ----------
+ * pgstat_create_shared_stats() -
  *
- * Format up the arglist for, then fork and exec, statistics collector process
+ *    create shared stats memory
+ * ----------
  */
-static pid_t
-pgstat_forkexec(void)
+static void
+pgstat_create_shared_stats(void)
 {
-    char       *av[10];
-    int            ac = 0;
+    MemoryContext oldcontext;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
 
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
 
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
 
 /* ------------------------------------------------------------
  * Public functions used by backends follow
@@ -802,41 +430,107 @@ allow_immediate_pgstat_restart(void)
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
  */
-void
-pgstat_report_stat(bool force)
+long
+pgstat_update_stat(bool force)
 {
     /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
     static TimestampTz last_report = 0;
-
+    static TimestampTz oldest_pending = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
     last_report = now;
 
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
      * entries it points to.  (Should we fail partway through the loop below,
@@ -849,23 +543,55 @@ pgstat_report_stat(bool force)
     pgStatTabHash = NULL;
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
 
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            PgStat_TableStatus *pentry = NULL;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -878,178 +604,440 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
 {
-    int            n;
-    int            len;
+    Assert (deststat != srcstat);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
     }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
 }
-
+        
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
  */
 static void
-pgstat_send_funcstats(void)
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* find the shared function stats table */
+    if (!dbentry)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
 
-    have_function_stats = false;
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
 }
 
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId);
+    oidtab = pgstat_collect_oids(DatabaseRelationId);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1057,148 +1045,86 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId);
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
-/* ----------
+/*
  * pgstat_collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
  */
 static HTAB *
 pgstat_collect_oids(Oid catalogid)
@@ -1241,62 +1167,54 @@ pgstat_collect_oids(Oid catalogid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1305,20 +1223,51 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1327,29 +1276,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert(db_stats);
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1358,17 +1315,90 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1382,48 +1412,75 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Repot about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1434,9 +1491,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1465,114 +1527,228 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgstat_pending_tempfile)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * clean up function for temporary files
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1688,9 +1864,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1712,6 +1885,15 @@ pgstat_initstats(Relation rel)
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
 
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
           relkind == RELKIND_MATVIEW ||
@@ -1723,13 +1905,6 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
     /*
      * If we already set up this relation in the current transaction, nothing
      * to do.
@@ -2373,34 +2548,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2413,47 +2560,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2472,18 +2600,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2558,9 +2682,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2575,9 +2701,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2767,7 +2895,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2956,7 +3084,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3223,7 +3351,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4140,96 +4269,68 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4237,302 +4338,15 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    pqsignal(SIGCHLD, SIG_DFL);
-    pqsignal(SIGTTIN, SIG_DFL);
-    pqsignal(SIGTTOU, SIG_DFL);
-    pqsignal(SIGCONT, SIG_DFL);
-    pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4558,20 +4372,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4580,47 +4391,76 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4656,29 +4496,23 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4697,7 +4531,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4709,32 +4543,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4777,16 +4608,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4794,15 +4615,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4820,10 +4640,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4832,9 +4652,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4861,23 +4682,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -4912,47 +4740,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     /*
      * The tables will live in pgStatLocalContext.
@@ -4960,28 +4771,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4995,11 +4796,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        LWLockRelease(StatsLock);
+        return;
     }
 
     /*
@@ -5008,7 +4810,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5016,11 +4818,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5031,17 +4834,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5061,7 +4864,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5070,21 +4873,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5092,54 +4897,26 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5147,36 +4924,62 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
+    LWLockRelease(StatsLock);
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5187,7 +4990,10 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5201,7 +5007,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5214,7 +5020,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5234,7 +5040,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5246,19 +5052,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5268,7 +5076,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5280,19 +5088,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5302,7 +5111,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5312,276 +5121,290 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5612,6 +5435,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5621,717 +5446,112 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
+
+    /*
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
+     */
+    backend_clean_snapshot_callback(¶m);
 }
 
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Otherwise add the values to the existing entry.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
     /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 688f462e7d..a7d6ddeac7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -1293,12 +1292,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1751,11 +1744,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = pgarch_start();
@@ -2580,8 +2568,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2912,8 +2898,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2980,13 +2964,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3063,22 +3040,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3305,7 +3266,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3525,22 +3486,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3744,8 +3689,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3784,8 +3727,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3985,8 +3927,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4959,18 +4899,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5083,12 +5011,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b20f6c379c..20cf33354a 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6e420d893c..862582da23 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -127,7 +127,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -143,6 +143,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -533,7 +536,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -876,7 +879,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 277da69fa6..087850d089 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -491,7 +491,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1324,6 +1324,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index edee89c116..18e73b0288 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..cce6d3ffa2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -282,8 +283,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a3b9757565..ee4e43331b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3146,6 +3146,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4160,9 +4167,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4197,7 +4212,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4205,6 +4220,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..eca801eeed 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1850,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1882,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..1377bbbbdb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..1e4fa89135 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0327b295da..cfdffbca2b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -188,7 +188,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3766,17 +3765,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10725,35 +10713,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3fe257c53f..58cb38e00d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -501,7 +501,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f0c1..f13b2dde6b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d6b32c070c..2e74ec9f60 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -402,7 +403,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8..e5f912cf71 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -41,32 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,271 +145,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -485,79 +202,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -601,10 +245,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1141,7 +788,7 @@ extern char *pgstat_stat_filename;
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1153,34 +800,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1191,6 +824,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1218,6 +853,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1337,15 +975,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1354,4 +992,14 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index b4654cb5ca..379f0bc5c0 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73287..4cb628b15f 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..b8a56645b6 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 0d705a3f2e..da488ebfd4 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -295,7 +295,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From ec1227e81c8440cf16637ab63782a8addfd50f12 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 4/9] Make archiver process an auxiliary process

Shared-memory based stats collector make archiver process use shared
memory. Make the process an auxiliary process for the reason.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d..dab0addd8b 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -327,6 +327,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -454,6 +457,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3ca36d62a4..7d4e528096 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -66,7 +66,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -85,7 +84,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -103,75 +101,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -211,8 +140,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -247,8 +176,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index cca64eca83..2d3f7cb898 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2981,6 +2981,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4244,6 +4247,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a7d6ddeac7..559aeedb6e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -548,6 +549,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1746,7 +1748,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2897,7 +2899,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3033,10 +3035,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3266,7 +3266,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, stats collector or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3471,6 +3471,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3736,6 +3748,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -4991,7 +5004,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5294,6 +5307,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 2e74ec9f60..91c3fb1a0a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -400,6 +400,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e5f912cf71..4e51580076 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -353,6 +353,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 292e63a26a..5db1d7a5ea 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 38230bf48e2e8721e62ad48f6ad9f15978cab0c6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 10:59:17 +0900
Subject: [PATCH 5/9] Let pg_stat_statements not to use PG_STAT_TMP_DIR.

This patchset removes the definition because pg_stat.c no longer uses
the directory and no other sutable module to pass it over. As a
tentative solution this patch moves query text file into permanent
stats directory. pg_basebackup and pg_rewind are conscious of the
directory. They currently omit the text file but becomes to copy it by
this change.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79f54..ec2fa9881c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
-- 
2.16.3

From 2601fadfe97d0a250c4543cb33d52a6925e7eda6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:46:43 +0900
Subject: [PATCH 6/9] Remove pg_stat_tmp exclusion from pg_rewind

The directory "pg_stat_tmp" no longer exists so remove it from the
exclusion list.
---
 src/bin/pg_rewind/filemap.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 222b56f58a..ef2d594c91 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, bool is_source);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
-- 
2.16.3

From 5d8f703faa7d65aff5ad770b1f83b09be1436145 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 7/9] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 567d2246e8..0c93988d2d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6094,25 +6094,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1678c8cbac..36184faf34 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15953,8 +15953,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index add71458e2..a1031b3b2a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f0b2145208..11f263f378 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2612,7 +2612,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index e2662bbf81..bf9c5dd580 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -270,8 +270,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From 228ebf04b84ec8abafa306fb98832d2512df724f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 9 Nov 2018 15:48:49 +0900
Subject: [PATCH 8/9] Split out backend status monitor part from pgstat

A large file, pgstat.c, contained two major facilities, backend status
monitor and database usage monitor. Split out the former part from the
file and name the module "bestatus". The names of individual functions
are left alone except for some fucntions whose names confict.
---
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    1 +
 src/backend/access/transam/xact.c                  |    1 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/commands/vacuumlazy.c                  |    1 +
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |    1 +
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    1 +
 src/backend/postmaster/checkpointer.c              |    1 +
 src/backend/postmaster/pgarch.c                    |    1 +
 src/backend/postmaster/postmaster.c                |    1 +
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |    6 +-
 src/backend/replication/logical/worker.c           |   11 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1756 ++++++++++++++++++++
 src/backend/{postmaster => statmon}/pgstat.c       | 1727 +------------------
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    2 +-
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |    1 +
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |    1 +
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |    4 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/include/bestatus.h                             |  544 ++++++
 src/include/pgstat.h                               |  514 +-----
 70 files changed, 2438 insertions(+), 2258 deletions(-)
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 rename src/backend/{postmaster => statmon}/pgstat.c (70%)
 create mode 100644 src/include/bestatus.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 3a58bf6685..9921dca7f9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..30890f11ea 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..6679dbc3a5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..14d183b0da 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -63,9 +63,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8b7ff5b0c2..9971bfe4f2 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 84197192ec..7e5c84bd5f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1132eef038..90a6f14899 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c3..ba78461ff0 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3942734e5a..e2c1be7422 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e07b..84c51c6ac8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e52ae54821..018a3737dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a31adcca5e..b72da3f45f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc9220f..b739f650d6 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index dab0addd8b..0782cf11b9 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -327,9 +328,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -339,6 +337,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -415,6 +416,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..dff87c1d84 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 13ef232d39..a99ea3dbfe 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 304ef07f2c..856f497d51 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -40,6 +40,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index ad16c783bd..281f27998a 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6ffaa751f2..7a850e8192 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index d017bbfbd3..84ba04da57 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6a576572bb..5a304c7405 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 4eb21fe89d..517b22a694 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 6eaed5bf0c..5906682fbf 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 10e707e9a1..ce2b441c37 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e146..01eaa187ff 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a4b1079e60..aea6a15b74 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9235390bc6..2968d356ed 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7d4e528096..deec58b057 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,6 +34,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 559aeedb6e..1719bb8d31 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 29bdcec895..d23987b20e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index fb66bceeed..09021b54c4 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 20cf33354a..1ce0809361 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
@@ -25,7 +26,6 @@
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
 #include "pgtar.h"
-#include "pgstat.h"
 #include "port.h"
 #include "postmaster/syslogger.h"
 #include "replication/basebackup.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 1e1695ef4f..b992473fd4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ada16adb67..bf7ac927f7 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index bf97dcdee4..a60ef0a9f1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..7af1e6b6b1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6cd6c67d1..846e1e7267 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 862582da23..670552593f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,25 +86,27 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 087850d089..3dc686f0df 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -23,13 +23,11 @@
 
 #include "postgres.h"
 
-#include "miscadmin.h"
-#include "pgstat.h"
-#include "funcapi.h"
-
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -41,17 +39,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1f2e7139a7..1620313c55 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index af5ad5fe66..957aea0a7d 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6f4b3538ac..0d65ed8f2a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2683385ca6..bfe18c860b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..1ea4f80a58
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1756 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWRITE";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/statmon/pgstat.c
similarity index 70%
rename from src/backend/postmaster/pgstat.c
rename to src/backend/statmon/pgstat.c
index 2d3f7cb898..4a1101c2b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -8,7 +8,7 @@
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
- *    src/backend/postmaster/pgstat.c
+ *    src/backend/statmon/pgstat.c
  * ----------
  */
 #include "postgres.h"
@@ -21,19 +21,14 @@
 #include "access/htup_details.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -68,26 +63,12 @@ typedef enum
     PGSTAT_ENTRY_LOCK_FAILED
 } pg_stat_table_result_status;
 
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
 /* ----------
  * GUC parameters
  * ----------
  */
-bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
@@ -125,6 +106,8 @@ static bool pgstat_pending_recoveryconflict = false;
 static bool pgstat_pending_deadlock = false;
 static bool pgstat_pending_tempfile = false;
 
+static MemoryContext pgStatLocalContext = NULL;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -236,15 +219,8 @@ typedef struct
 /*
  * Info about current "snapshot" of stats file
  */
-static MemoryContext pgStatLocalContext = NULL;
 static HTAB *pgStatDBHash = NULL;
 
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
 /*
  * Cluster wide statistics.
  * Contains statistics that are not collected per database or per table.
@@ -280,7 +256,6 @@ static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dsha
 /* functions used in backends */
 static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
-static void pgstat_read_current_status(void);
 
 static void pgstat_postmaster_shutdown(int code, Datum arg);
 static void pgstat_apply_pending_tabstats(bool shared, bool force,
@@ -307,12 +282,6 @@ static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
 
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
 static bool pgstat_update_tabentry(dshash_table *tabhash,
                                    PgStat_TableStatus *stat, bool nowait);
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
@@ -323,6 +292,14 @@ static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
  * ------------------------------------------------------------
  */
 
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
 /*
  * subroutine for pgstat_reset_all
  */
@@ -484,7 +461,7 @@ pgstat_update_stat(bool force)
          */
         TimestampDifference(last_report, now, &secs, &usecs);
         elapsed = secs * 1000 + usecs /1000;
-        
+
         if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
         {
             /* we know we have some statistics */
@@ -740,7 +717,7 @@ pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
             pgStatBlockReadTime = 0;
             pgStatBlockWriteTime = 0;
         }
-        
+
         cxt->tabhash =
             dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
@@ -800,7 +777,7 @@ pgstat_merge_tabentry(PgStat_TableStatus *deststat,
         dest->t_blocks_hit += src->t_blocks_hit;
     }
 }
-        
+
 /*
  * pgstat_update_funcstats: subroutine for pgstat_update_stat
  *
@@ -920,7 +897,7 @@ pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
                 hash_search(pgStatPendingFunctions,
                             (void *) &(pendent->functionid), HASH_REMOVE, NULL);
             }
-        }    
+        }
 
         /* destroy the hsah if no entry remains */
         if (hash_get_num_entries(pgStatPendingFunctions) == 0)
@@ -1058,7 +1035,7 @@ pgstat_vacuum_stat(void)
     dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
     if (!dbentry)
         return;
-    
+
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -2611,66 +2588,6 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     return funcentry;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
@@ -2708,364 +2625,6 @@ pgstat_fetch_global(void)
     return snapshot_globalStats;
 }
 
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    before_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3078,8 +2637,6 @@ pgstat_bestart(void)
 static void
 pgstat_beshutdown_hook(int code, Datum arg)
 {
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
     /*
      * If we got as far as discovering our own database ID, we can report what
      * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3088,1188 +2645,9 @@ pgstat_beshutdown_hook(int code, Datum arg)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_update_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
 }
 
 
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(IsUnderPostmaster);
-
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWRITE";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
 /* ------------------------------------------------------------
  * Local support functions follow
  * ------------------------------------------------------------
@@ -5412,22 +3790,6 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               funcid);
 }
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5443,6 +3805,8 @@ pgstat_clear_snapshot(void)
 {
     int param = 0;    /* only the address is significant */
 
+    pgstat_bestatus_clear_snapshot();
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5450,8 +3814,6 @@ pgstat_clear_snapshot(void)
     /* Reset variables */
     pgStatLocalContext = NULL;
     pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
 
     /*
      * the parameter inform the function that it is not called from
@@ -5557,47 +3919,18 @@ pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
+/* ----------
+ * pgstat_setup_memcxt() -
  *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
  */
-char *
-pgstat_clip_activity(const char *raw_activity)
+static void
+pgstat_setup_memcxt(void)
 {
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
 }
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e794a81c4c..d92c7c935d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index dd687dfe71..34ef69d8d0 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 4a0d23b11e..4054ac5108 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2d75773ef0..4342fb3e39 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 0ff1f5be91..a3465f57ae 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index c129446f9c..51c6fff11c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 908f62d37e..7c6d327e2f 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index fde71afd47..a0a5582aac 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index c9bb3e987d..0a9181cd9d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..22a42b9977 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c46bb8d057..8ddb4c88e0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8390311d0..ac352885f3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f9aaa52fa..0ecaa24b1a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 86013a5c8b..92b3cd8b55 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ee4e43331b..f894bac680 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 309eb2935c..b229f42622 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -20,6 +20,7 @@
 #include <unistd.h>
 
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -28,7 +29,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eca801eeed..29da24b91d 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 905867dc76..29df1e9773 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 238fe1deec..1a25c813f2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1e4fa89135..d4774d717f 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -25,6 +25,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -688,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cfdffbca2b..45974082b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..3b47e9c063
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,544 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bestatus_clear_snapshot(void);
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e51580076..4d1fc422ab 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -14,11 +14,8 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "lib/dshash.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -94,12 +91,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -167,10 +163,10 @@ typedef struct PgStat_BgWriter
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -203,7 +199,7 @@ typedef struct PgStat_FunctionEntry
 } PgStat_FunctionEntry;
 
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -307,7 +303,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -323,7 +319,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -341,422 +337,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -778,10 +358,8 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
@@ -828,26 +406,9 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_clear_snapshot(void);
 
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
@@ -858,60 +419,6 @@ extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
 extern HTAB *backend_snapshot_all_db_entries(void);
 extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -979,6 +486,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_update_archiver(const char *xlog, bool failed);
 extern void pgstat_update_bgwriter(void);
 
+extern void pgstat_report_tempfile(size_t filesize);
+
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
@@ -986,10 +495,7 @@ extern void pgstat_update_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-- 
2.16.3

From 3411b5bfd3a056ebd949a33f28fddd8e1fa82667 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 9 Nov 2018 16:14:48 +0900
Subject: [PATCH 9/9] Fix include files of contribs

Splitting pgstat.h affects some programs that used to include pgstat.h.
Fix them.
---
 contrib/pg_prewarm/autoprewarm.c                | 2 +-
 contrib/pg_stat_statements/pg_stat_statements.c | 1 +
 contrib/postgres_fdw/connection.c               | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 03bf90ce2d..41be8c6c65 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/heapam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index ec2fa9881c..3a35798cfb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..e2b250fde4 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
-- 
2.16.3


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:

On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:
> Hello. This is rebased version.
> 
> At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>
>> However a quite a few extensions in contrib seem are broken now. It
>> seems fixing it is as simple as including the new bestatus.h next to
>> pgstat.h.
> 
> The additional 0009 does that.
> 

That does fix it, indeed. But the break happens in 0003, so that's where 
the fixes should be moved - I've tried to simply apply 0009 right after 
0003, but that does not seem to work because bestatus.h does not exist 
at that point yet :-/

The current split into 8 parts seems quite sensible to me, i.e. that's 
how it might get committed eventually. That however means each part 
needs to be correct on it's own (hence fixes in 0009 are a problem).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
At Fri, 9 Nov 2018 14:16:31 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<803f2d96-3b4b-f357-9a2e-45443212f13d@2ndquadrant.com>
> 
> 
> On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:
> > Hello. This is rebased version.
> > At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote in
> > <de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>
> >> However a quite a few extensions in contrib seem are broken now. It
> >> seems fixing it is as simple as including the new bestatus.h next to
> >> pgstat.h.
> > The additional 0009 does that.
> > 
> 
> That does fix it, indeed. But the break happens in 0003, so that's
> where the fixes should be moved - I've tried to simply apply 0009
> right after 0003, but that does not seem to work because bestatus.h
> does not exist at that point yet :-/

Sorry, I misunderstood you. The real reason for breaking 0003 as
you saw was the result I just removed PG_STAT_TMP_DIR. 0005 fixes
that later. I (half-intentionally) didin't keep soundness of the
source tree at v8-0003 and v8-0008.

> The current split into 8 parts seems quite sensible to me, i.e. that's
> how it might get committed eventually. That however means each part
> needs to be correct on it's own (hence fixes in 0009 are a problem).

Thanks. I neatended up the patchset so that individual patch
keeps source buildable and doesn't break programs' behaviors.

v9-0001-sequential-scan-for-dshash.patch
v9-0002-Add-conditional-lock-feature-to-dshash.patch
  same to v8

v9-0003-Make-archiver-process-an-auxiliary-process.patch
  moved from v8-0004 since this is applicable independently from
  v8-0003.

v9-0004-Shared-memory-based-stats-collector.patch
  v8-0003 + some fixes to make contribs work and removed initdb
  part.

v9-0005-Remove-statistics-temporary-directory.patch
  v8-0005 + v8-0006 + initdb part of v8-0003. pg_stat_statements
  may still need fix since I changed only the directory for
  temporary query files.

v9-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
  v8-0008 plus v8-0009

v9-0007-Documentation-update.patch
  this still leaves description about UDP-based stats collector
  and will need further edit.

 regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5cc354bfd9bd37660173ddb940b036566ff31cca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 7/7] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml        |  4 +---
 doc/src/sgml/config.sgml        | 19 -------------------
 doc/src/sgml/func.sgml          |  3 +--
 doc/src/sgml/monitoring.sgml    |  7 +------
 doc/src/sgml/protocol.sgml      |  2 +-
 doc/src/sgml/ref/pg_rewind.sgml |  3 +--
 doc/src/sgml/storage.sgml       |  6 ------
 7 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efdd78..31e94c1fe9 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1116,11 +1116,9 @@ SELECT pg_stop_backup();
    <para>
     The contents of the directories <filename>pg_dynshmem/</filename>,
     <filename>pg_notify/</filename>, <filename>pg_serial/</filename>,
-    <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
+    <filename>pg_snapshots/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0f8f2ef920..58d2b791b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6094,25 +6094,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1678c8cbac..36184faf34 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15953,8 +15953,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
  PG_VERSION      | 15
  pg_wal          | 16
  pg_hba.conf     | 17
- pg_stat_tmp     | 18
- pg_subtrans     | 19
+ pg_subtrans     | 18
 (19 rows)
 </programlisting>
   </para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index add71458e2..a1031b3b2a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f0b2145208..11f263f378 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2612,7 +2612,7 @@ The commands accepted in replication mode are:
         <para>
          <filename>pg_dynshmem</filename>, <filename>pg_notify</filename>,
          <filename>pg_replslot</filename>, <filename>pg_serial</filename>,
-         <filename>pg_snapshots</filename>, <filename>pg_stat_tmp</filename>, and
+         <filename>pg_snapshots</filename>, and
          <filename>pg_subtrans</filename> are copied as empty directories (even if
          they are symbolic links).
         </para>
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index e2662bbf81..bf9c5dd580 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -270,8 +270,7 @@ PostgreSQL documentation
       (everything except the relation files). Similarly to base backups,
       the contents of the directories <filename>pg_dynshmem/</filename>,
       <filename>pg_notify/</filename>, <filename>pg_replslot/</filename>,
-      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
-      <filename>pg_stat_tmp/</filename>, and
+      <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>, and
       <filename>pg_subtrans/</filename> are omitted from the data copied
       from the source cluster. Any file or directory beginning with
       <filename>pgsql_tmp</filename> is omitted, as well as are
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..5ee7493970 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
-- 
2.16.3

From cdbeaedb114216847c6c5f92df8dd285cf116d17 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 9 Nov 2018 15:48:49 +0900
Subject: [PATCH 6/7] Split out backend status monitor part from pgstat

A large file, pgstat.c, contained two major facilities, backend status
monitor and database usage monitor. Split out the former part from the
file and name the module "bestatus". The names of individual functions
are left alone except for some conficts.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    1 +
 src/backend/access/transam/xact.c                  |    1 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/commands/vacuumlazy.c                  |    1 +
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |    1 +
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    1 +
 src/backend/postmaster/checkpointer.c              |    1 +
 src/backend/postmaster/pgarch.c                    |    1 +
 src/backend/postmaster/postmaster.c                |    1 +
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    2 +-
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |    6 +-
 src/backend/replication/logical/worker.c           |   11 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1756 ++++++++++++++++++++
 src/backend/{postmaster => statmon}/pgstat.c       | 1727 +------------------
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    2 +-
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |    1 +
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |    1 +
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |    4 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/include/bestatus.h                             |  544 ++++++
 src/include/pgstat.h                               |  514 +-----
 73 files changed, 2441 insertions(+), 2260 deletions(-)
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 rename src/backend/{postmaster => statmon}/pgstat.c (70%)
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 03bf90ce2d..41be8c6c65 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/heapam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index ec2fa9881c..3a35798cfb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..e2b250fde4 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 3a58bf6685..9921dca7f9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..30890f11ea 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..6679dbc3a5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..14d183b0da 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -63,9 +63,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8b7ff5b0c2..9971bfe4f2 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 84197192ec..7e5c84bd5f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 1132eef038..90a6f14899 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d36050c3..ba78461ff0 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3942734e5a..e2c1be7422 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e07b..84c51c6ac8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e52ae54821..018a3737dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index a31adcca5e..b72da3f45f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc9220f..b739f650d6 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index dab0addd8b..0782cf11b9 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -327,9 +328,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -339,6 +337,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -415,6 +416,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..dff87c1d84 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 13ef232d39..a99ea3dbfe 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c153d74f41..f163daf408 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -40,6 +40,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index afddb0a039..072a9cee23 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index a9f812d66b..9545fb0994 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 08a8bb3426..7b6f75805a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6a576572bb..5a304c7405 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 4eb21fe89d..517b22a694 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 6eaed5bf0c..5906682fbf 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 10e707e9a1..ce2b441c37 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e146..01eaa187ff 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a4b1079e60..aea6a15b74 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9235390bc6..2968d356ed 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7d4e528096..deec58b057 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,6 +34,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 559aeedb6e..1719bb8d31 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 29bdcec895..d23987b20e 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index fb66bceeed..09021b54c4 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 20cf33354a..1ce0809361 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
@@ -25,7 +26,6 @@
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
 #include "pgtar.h"
-#include "pgstat.h"
 #include "port.h"
 #include "postmaster/syslogger.h"
 #include "replication/basebackup.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 1e1695ef4f..b992473fd4 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index ada16adb67..bf7ac927f7 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index bf97dcdee4..a60ef0a9f1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..7af1e6b6b1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index a6cd6c67d1..846e1e7267 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 862582da23..670552593f 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,25 +86,27 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 087850d089..3dc686f0df 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -23,13 +23,11 @@
 
 #include "postgres.h"
 
-#include "miscadmin.h"
-#include "pgstat.h"
-#include "funcapi.h"
-
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -41,17 +39,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1f2e7139a7..1620313c55 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index af5ad5fe66..957aea0a7d 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 6f4b3538ac..0d65ed8f2a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2683385ca6..bfe18c860b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..1ea4f80a58
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1756 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWRITE";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/statmon/pgstat.c
similarity index 70%
rename from src/backend/postmaster/pgstat.c
rename to src/backend/statmon/pgstat.c
index 2d3f7cb898..4a1101c2b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -8,7 +8,7 @@
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
- *    src/backend/postmaster/pgstat.c
+ *    src/backend/statmon/pgstat.c
  * ----------
  */
 #include "postgres.h"
@@ -21,19 +21,14 @@
 #include "access/htup_details.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -68,26 +63,12 @@ typedef enum
     PGSTAT_ENTRY_LOCK_FAILED
 } pg_stat_table_result_status;
 
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
 /* ----------
  * GUC parameters
  * ----------
  */
-bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
@@ -125,6 +106,8 @@ static bool pgstat_pending_recoveryconflict = false;
 static bool pgstat_pending_deadlock = false;
 static bool pgstat_pending_tempfile = false;
 
+static MemoryContext pgStatLocalContext = NULL;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -236,15 +219,8 @@ typedef struct
 /*
  * Info about current "snapshot" of stats file
  */
-static MemoryContext pgStatLocalContext = NULL;
 static HTAB *pgStatDBHash = NULL;
 
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
 /*
  * Cluster wide statistics.
  * Contains statistics that are not collected per database or per table.
@@ -280,7 +256,6 @@ static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dsha
 /* functions used in backends */
 static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
-static void pgstat_read_current_status(void);
 
 static void pgstat_postmaster_shutdown(int code, Datum arg);
 static void pgstat_apply_pending_tabstats(bool shared, bool force,
@@ -307,12 +282,6 @@ static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
 
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
 static bool pgstat_update_tabentry(dshash_table *tabhash,
                                    PgStat_TableStatus *stat, bool nowait);
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
@@ -323,6 +292,14 @@ static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
  * ------------------------------------------------------------
  */
 
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
 /*
  * subroutine for pgstat_reset_all
  */
@@ -484,7 +461,7 @@ pgstat_update_stat(bool force)
          */
         TimestampDifference(last_report, now, &secs, &usecs);
         elapsed = secs * 1000 + usecs /1000;
-        
+
         if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
         {
             /* we know we have some statistics */
@@ -740,7 +717,7 @@ pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
             pgStatBlockReadTime = 0;
             pgStatBlockWriteTime = 0;
         }
-        
+
         cxt->tabhash =
             dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
@@ -800,7 +777,7 @@ pgstat_merge_tabentry(PgStat_TableStatus *deststat,
         dest->t_blocks_hit += src->t_blocks_hit;
     }
 }
-        
+
 /*
  * pgstat_update_funcstats: subroutine for pgstat_update_stat
  *
@@ -920,7 +897,7 @@ pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
                 hash_search(pgStatPendingFunctions,
                             (void *) &(pendent->functionid), HASH_REMOVE, NULL);
             }
-        }    
+        }
 
         /* destroy the hsah if no entry remains */
         if (hash_get_num_entries(pgStatPendingFunctions) == 0)
@@ -1058,7 +1035,7 @@ pgstat_vacuum_stat(void)
     dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
     if (!dbentry)
         return;
-    
+
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -2611,66 +2588,6 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     return funcentry;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
@@ -2708,364 +2625,6 @@ pgstat_fetch_global(void)
     return snapshot_globalStats;
 }
 
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    before_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3078,8 +2637,6 @@ pgstat_bestart(void)
 static void
 pgstat_beshutdown_hook(int code, Datum arg)
 {
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
     /*
      * If we got as far as discovering our own database ID, we can report what
      * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3088,1188 +2645,9 @@ pgstat_beshutdown_hook(int code, Datum arg)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_update_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
 }
 
 
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(IsUnderPostmaster);
-
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWRITE";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
 /* ------------------------------------------------------------
  * Local support functions follow
  * ------------------------------------------------------------
@@ -5412,22 +3790,6 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               funcid);
 }
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5443,6 +3805,8 @@ pgstat_clear_snapshot(void)
 {
     int param = 0;    /* only the address is significant */
 
+    pgstat_bestatus_clear_snapshot();
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5450,8 +3814,6 @@ pgstat_clear_snapshot(void)
     /* Reset variables */
     pgStatLocalContext = NULL;
     pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
 
     /*
      * the parameter inform the function that it is not called from
@@ -5557,47 +3919,18 @@ pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
+/* ----------
+ * pgstat_setup_memcxt() -
  *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
  */
-char *
-pgstat_clip_activity(const char *raw_activity)
+static void
+pgstat_setup_memcxt(void)
 {
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
 }
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e794a81c4c..d92c7c935d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index dd687dfe71..34ef69d8d0 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 4a0d23b11e..4054ac5108 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2d75773ef0..4342fb3e39 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 0ff1f5be91..a3465f57ae 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index c129446f9c..51c6fff11c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dc7e875680..293d15661a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index fde71afd47..a0a5582aac 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index c9bb3e987d..0a9181cd9d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..22a42b9977 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c46bb8d057..8ddb4c88e0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8390311d0..ac352885f3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f9aaa52fa..0ecaa24b1a 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 86013a5c8b..92b3cd8b55 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ee4e43331b..f894bac680 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 309eb2935c..b229f42622 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -20,6 +20,7 @@
 #include <unistd.h>
 
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -28,7 +29,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index eca801eeed..29da24b91d 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 905867dc76..29df1e9773 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 238fe1deec..1a25c813f2 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1e4fa89135..d4774d717f 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -25,6 +25,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -688,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index cfdffbca2b..45974082b9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..3b47e9c063
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,544 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bestatus_clear_snapshot(void);
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bd09937457..5583f92902 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -14,11 +14,8 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "lib/dshash.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -94,12 +91,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -167,10 +163,10 @@ typedef struct PgStat_BgWriter
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -203,7 +199,7 @@ typedef struct PgStat_FunctionEntry
 } PgStat_FunctionEntry;
 
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -307,7 +303,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -323,7 +319,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -341,422 +337,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -778,10 +358,8 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
 
 /*
@@ -826,26 +404,9 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_clear_snapshot(void);
 
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
@@ -856,60 +417,6 @@ extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
 extern HTAB *backend_snapshot_all_db_entries(void);
 extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -977,6 +484,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_update_archiver(const char *xlog, bool failed);
 extern void pgstat_update_bgwriter(void);
 
+extern void pgstat_report_tempfile(size_t filesize);
+
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
@@ -984,10 +493,7 @@ extern void pgstat_update_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-- 
2.16.3

From 34a990c156ce80321d007e3700c0ff4092dfe1db Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:12:46 +0900
Subject: [PATCH 5/7] Remove statistics temporary directory.

Server used to use a directory to store temporary statistics files,
which are written by statistics collector and read from backends. That
was the mechanism to share statistics data but it is no longer needed
since we moved it to shared memory. Remove it.  Some modules also need
to be fixed along with since the directory was cared about or they
were living on. initdb no longer create the directory,
backend/replication/basebackup.c and pg_rewind excluded the directory
so just removed the related code, and pg_stat_statements placed
temporary query file in the directory so the files is moved to
"permanent" stats file directory. We have to fix pg_stat_statements
later.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 11 +++-----
 src/backend/postmaster/pgstat.c                 |  6 -----
 src/backend/replication/basebackup.c            | 36 -------------------------
 src/bin/initdb/initdb.c                         |  1 -
 src/bin/pg_rewind/filemap.c                     |  7 -----
 src/include/pgstat.h                            |  3 ---
 6 files changed, 4 insertions(+), 60 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79f54..ec2fa9881c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -86,14 +86,11 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file. We only expect modest, infrequent I/O
+ * for query strings, so placing the file on a faster filesystem is not
+ * compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7f8c2111d0..2d3f7cb898 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -89,12 +89,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
     dsa_handle stats_dsa_handle;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b20f6c379c..20cf33354a 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -77,9 +77,6 @@ static bool is_checksummed_file(const char *fullpath, const char *filename);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -121,13 +118,6 @@ static bool noverify_checksums = false;
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
@@ -223,11 +213,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -254,18 +241,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1174,17 +1149,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f0c1..f13b2dde6b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 222b56f58a..ef2d594c91 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -43,13 +43,6 @@ static bool check_file_excluded(const char *path, bool is_source);
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8ed01499e5..bd09937457 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
-- 
2.16.3

From c4b92df8f6010cce2c0ecc26be905b5ec83f6023 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/7] Shared-memory based stats collector

This replaces the means to share server statistics numbers from file
to dynamic shared memory. Every backend directly reads and writres to
the stats tables. Stats collector process is removed.  Updates of
shared stats happens with the intervals not shorter than 500ms and not
longer than 1s. If the shared memory data is busy and a backend cannot
obtain lock immediately, usually the differences are stashed into
"pending stats" on local memory and merged with the number at the next
chance.
---
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/postmaster/autovacuum.c           |   59 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/pgarch.c               |    4 +-
 src/backend/postmaster/pgstat.c               | 4204 ++++++++++---------------
 src/backend/postmaster/postmaster.c           |   80 +-
 src/backend/replication/logical/tablesync.c   |    9 +-
 src/backend/replication/logical/worker.c      |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/dsm.c                 |   24 +-
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/tcop/postgres.c                   |   27 +-
 src/backend/utils/adt/pgstatfuncs.c           |   50 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    2 +-
 src/include/pgstat.h                          |  437 +--
 src/include/storage/dsm.h                     |    3 +
 src/include/storage/lwlock.h                  |    3 +
 src/include/utils/timeout.h                   |    1 +
 src/test/modules/worker_spi/worker_spi.c      |    2 +-
 27 files changed, 1928 insertions(+), 3087 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2..e52ae54821 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 978089575b..10e707e9a1 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -977,7 +977,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -986,6 +986,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -1001,7 +1002,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1013,6 +1014,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1025,7 +1027,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1037,6 +1039,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1235,7 +1238,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1273,16 +1276,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1971,7 +1980,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2021,7 +2030,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2107,6 +2116,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2186,10 +2197,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2758,12 +2770,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2795,8 +2805,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2827,6 +2837,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2917,7 +2929,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c53..a4b1079e60 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -271,9 +271,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c5..9235390bc6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -376,7 +376,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -402,7 +402,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -520,13 +520,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -694,9 +694,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1296,8 +1296,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 0b41857014..7d4e528096 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -414,7 +414,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -424,7 +424,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4fd874ad77..7f8c2111d0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -19,92 +14,59 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -132,27 +94,65 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
 
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
 
-static struct sockaddr_storage pgStatAddr;
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
@@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -237,6 +233,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -250,23 +252,15 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,32 +274,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -316,320 +319,16 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
 /*
  * subroutine for pgstat_reset_all
  */
@@ -678,119 +377,54 @@ pgstat_reset_remove_files(const char *directory)
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
+/* ----------
+ * pgstat_create_shared_stats() -
  *
- * Format up the arglist for, then fork and exec, statistics collector process
+ *    create shared stats memory
+ * ----------
  */
-static pid_t
-pgstat_forkexec(void)
+static void
+pgstat_create_shared_stats(void)
 {
-    char       *av[10];
-    int            ac = 0;
+    MemoryContext oldcontext;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
 
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
 
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
 
 /* ------------------------------------------------------------
  * Public functions used by backends follow
@@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
  */
-void
-pgstat_report_stat(bool force)
+long
+pgstat_update_stat(bool force)
 {
     /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
     static TimestampTz last_report = 0;
-
+    static TimestampTz oldest_pending = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
     last_report = now;
 
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
      * entries it points to.  (Should we fail partway through the loop below,
@@ -849,23 +549,55 @@ pgstat_report_stat(bool force)
     pgStatTabHash = NULL;
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
 
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            PgStat_TableStatus *pentry = NULL;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -878,178 +610,440 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
 {
-    int            n;
-    int            len;
+    Assert (deststat != srcstat);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
     }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
 }
-
+        
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
  */
 static void
-pgstat_send_funcstats(void)
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* find the shared function stats table */
+    if (!dbentry)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
 
-    have_function_stats = false;
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
 }
 
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId);
+    oidtab = pgstat_collect_oids(DatabaseRelationId);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1057,148 +1051,86 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId);
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
-/* ----------
+/*
  * pgstat_collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
  */
 static HTAB *
 pgstat_collect_oids(Oid catalogid)
@@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1305,20 +1229,51 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1327,29 +1282,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert(db_stats);
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1358,17 +1321,90 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1382,48 +1418,75 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Repot about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1434,9 +1497,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1465,114 +1533,228 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgstat_pending_tempfile)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * clean up function for temporary files
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1688,9 +1870,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1712,6 +1891,15 @@ pgstat_initstats(Relation rel)
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
 
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
           relkind == RELKIND_MATVIEW ||
@@ -1723,13 +1911,6 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
     /*
      * If we already set up this relation in the current transaction, nothing
      * to do.
@@ -2373,34 +2554,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2413,47 +2566,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2472,18 +2606,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2558,9 +2688,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2575,9 +2707,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2767,7 +2901,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2959,7 +3093,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3226,7 +3360,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4146,96 +4281,68 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4243,302 +4350,15 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    pqsignal(SIGCHLD, SIG_DFL);
-    pqsignal(SIGTTIN, SIG_DFL);
-    pqsignal(SIGTTOU, SIG_DFL);
-    pqsignal(SIGCONT, SIG_DFL);
-    pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4564,20 +4384,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4586,47 +4403,76 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid,
+                                 &lock_acquired, true, nowait);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4662,29 +4508,23 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4703,7 +4543,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4715,32 +4555,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4783,16 +4620,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4800,15 +4627,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4826,10 +4652,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4838,9 +4664,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4867,23 +4694,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -4918,47 +4752,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     /*
      * The tables will live in pgStatLocalContext.
@@ -4966,28 +4783,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5001,11 +4808,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        LWLockRelease(StatsLock);
+        return;
     }
 
     /*
@@ -5014,7 +4822,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5022,11 +4830,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5037,17 +4846,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5067,7 +4876,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5076,21 +4885,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5098,54 +4909,26 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5153,36 +4936,62 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
+    LWLockRelease(StatsLock);
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5193,7 +5002,10 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5207,7 +5019,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5220,7 +5032,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5240,7 +5052,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5252,19 +5064,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5274,7 +5088,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5286,19 +5100,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5308,7 +5123,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5318,276 +5133,290 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5618,6 +5447,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5627,717 +5458,112 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
+
+    /*
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
+     */
+    backend_clean_snapshot_callback(¶m);
 }
 
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Otherwise add the values to the existing entry.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
     /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8336f136c7..559aeedb6e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -1295,12 +1294,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1753,11 +1746,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2582,8 +2570,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2914,8 +2900,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2982,13 +2966,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3063,22 +3040,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3537,22 +3498,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3756,8 +3701,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3796,8 +3739,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3998,8 +3940,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4972,18 +4912,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5096,12 +5024,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 6e420d893c..862582da23 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -127,7 +127,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -143,6 +143,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -533,7 +536,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -876,7 +879,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 277da69fa6..087850d089 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -491,7 +491,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1324,6 +1324,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index edee89c116..18e73b0288 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..cce6d3ffa2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -282,8 +283,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a3b9757565..ee4e43331b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3146,6 +3146,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4160,9 +4167,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4197,7 +4212,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4205,6 +4220,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e95e347184..eca801eeed 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1850,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1882,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..1377bbbbdb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0d28..1e4fa89135 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0327b295da..cfdffbca2b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -188,7 +188,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3766,17 +3765,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10725,35 +10713,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3fe257c53f..58cb38e00d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -501,7 +501,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 2211d90c6f..e6f4d30658 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3026f95728..91c3fb1a0a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 33c7372f00..8ed01499e5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -41,32 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,271 +148,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -485,79 +205,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -601,10 +248,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1136,13 +786,11 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
-extern char *pgstat_stat_tmpname;
-extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1154,34 +802,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1192,6 +826,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1219,6 +855,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1338,15 +977,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1355,4 +994,14 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index b4654cb5ca..379f0bc5c0 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73287..4cb628b15f 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..b8a56645b6 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 0d705a3f2e..da488ebfd4 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -295,7 +295,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From 13915f12e2d0d1acf7c6f21151c173ab117b555f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d..dab0addd8b 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -327,6 +327,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -454,6 +457,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 885e85ad8a..0b41857014 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -66,7 +66,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -85,7 +84,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -103,75 +101,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -211,8 +140,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -247,8 +176,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 42bccce0af..4fd874ad77 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2853,6 +2853,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4115,6 +4118,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 688f462e7d..8336f136c7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -549,6 +550,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1758,7 +1760,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2911,7 +2913,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3056,10 +3058,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3305,7 +3305,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3510,6 +3510,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3794,6 +3806,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5063,7 +5076,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5372,6 +5385,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d6b32c070c..3026f95728 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8..33c7372f00 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 292e63a26a..5db1d7a5ea 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 93a81386f61e9973ef5cae5e007a48c9bf042b3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 58 ++++++++++++++++++++++++++++++++++++++++++++----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index af904c034e..d8bdaecae5 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,6 +394,17 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, NULL, exclusive, false);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool *lock_acquired, bool exclusive, bool nowait)
 {
     dshash_hash hash;
     size_t        partition;
@@ -405,8 +416,23 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +467,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +497,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8ab1a21f3e..475d22ab55 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool *lock_acquired, bool exclusive, bool nowait);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From cf5581b76449cfe9e1cb323a56ff35781290b730 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/7] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b2b8fe60e1..af904c034e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8ab1a21f3e 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On Mon, 2018-11-12 at 20:10 +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 9 Nov 2018 14:16:31 +0100, Tomas Vondra <
> tomas.vondra@2ndquadrant.com> wrote in <
> 803f2d96-3b4b-f357-9a2e-45443212f13d@2ndquadrant.com>
> > 
> > 
> > On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:
> > > Hello. This is rebased version.
> > > At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra
> > > <tomas.vondra@2ndquadrant.com> wrote in
> > > <de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>
> > >> However a quite a few extensions in contrib seem are broken now.
> It
> > >> seems fixing it is as simple as including the new bestatus.h
> next to
> > >> pgstat.h.
> > > The additional 0009 does that.
> > > 
> > 
> > That does fix it, indeed. But the break happens in 0003, so that's
> > where the fixes should be moved - I've tried to simply apply 0009
> > right after 0003, but that does not seem to work because bestatus.h
> > does not exist at that point yet :-/
> 
> Sorry, I misunderstood you. The real reason for breaking 0003 as
> you saw was the result I just removed PG_STAT_TMP_DIR. 0005 fixes
> that later. I (half-intentionally) didin't keep soundness of the
> source tree at v8-0003 and v8-0008.
> 
> > The current split into 8 parts seems quite sensible to me, i.e.
> that's
> > how it might get committed eventually. That however means each part
> > needs to be correct on it's own (hence fixes in 0009 are a
> problem).
> 
> Thanks. I neatended up the patchset so that individual patch
> keeps source buildable and doesn't break programs' behaviors.
> 

OK, thanks. I'll take a look. I also plan to do much more testing, both
for correctness and performance - it's quite piece of functionality.

If everything goes well I'd like to get this committed by the end of
January CF (with some of the initial parts in this CF, possibly).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
Hi,

Unfortunately, the patch does not apply anymore - it seems it got broken
by the changes to signal handling and/or removal of magic OIDs :-(

I've done a review and testing when applied on top of 10074651e335.

Firstly, the testing - I was wondering if the patch has some performance
impact, so I've done some testing with a read-only workload on large
number of tables (1k, 10k and 100k) while concurrently selecting data
from pg_stat_* catalogs at the same time.

In one case both workloads were running against the same database, in
another there were two separate databases (and the selects from stat
catalogs were running against an "empty" database with no use tables).

In both cases there were 8 clients doing selects from the user tables,
and 4 clients accessing the pg_stat_* catalogs.

For the "single database" case the results looks like this (this is just
patched / master throughput):

    # of tables     xact     stats
    ------------------------------
    1000          97.71%    98.76%
    10000        100.38%    97.97%
    100000       100.10%    98.50%

xact is throughput of the user workload (select from the large number of
tables) and stats is throughput for selects from system catalogs.

So pretty much no difference - 2% is within noise on this machine.

On two separate databases the results are a bit more interesting:

    # of tables     xact      stats
    -------------------------------
    1000         100.49%     80.38%
    10000        103.18%     80.28%
    100000       100.85%     81.95%

For the main workload there's pretty much no difference, but for selects
from the stats catalogs there's ~20% drop in throughput. In absolute
numbers this means drop from ~670tps to ~550tps. I haven't investigated
this, but I suppose this is due to dshash seqscan being more expensive
than reading the data from file.

I don't think any of this is an issue in practice, though. The important
thing is that there's no measurable impact on the regular workload.

Now, a couple of comments regarding individual parts of the patch.


0001-0003
---------

I do think 0001 - 0003 are ready, with some minor cosmetic issues:

1) I'd rephrase the last part of dshash_seq_init comment more like this:

* If consistent is set for dshash_seq_init, the all hash table
* partitions are locked in the requested mode (as determined by the
* exclusive flag), and the locks are held until the end of the scan.
* Otherwise the partition locks are acquired and released as needed
* during the scan (up to two partitions may be locked at the same time).

Maybe it should briefly explain what the consistency guarantees are (and
aren't), but considering we're not materially changing the existing
behavior probably  is not really necessary.

2) I think the dshash_find_extended() signature should be more like
dshash_find(), i.e. just appending parameters instead of moving them
around unnecessarily. Perhaps we should add

    Assert(nowait || !lock_acquired);

Using nowait=false with lock_acquired!=NULL does not seem sensible.

3) I suppose this comment in postmaster.c is just copy-paste:

-#define BACKEND_TYPE_ARCHIVER    0x0010    /* bgworker process */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */

I wonder why wasn't archiver a regular auxiliary process already? It
seems like a fairly natural thing, so why not?


0004 (+0005 and 0007)
---------------------

This seems fine, but I have my doubts about two changes - removing of
stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.

There's a couple of issues with the stats_temp_directory. Firstly, I
don't understand why it's spread over multiple parts of the patch. The
GUC is removed in 0004, the underlying variable is removed in 0005 and
then the docs are updated in 0007. If we really want to do this, it
should happen in a single patch.

But the main question is - do we really want to do that? I understand
this directory was meant for the stats data we're moving to shared
memory, so removing it seems natural. But clearly it's used by
pg_stat_statements - 0005 fixes that, of course, but I wonder if there
are other extensions using it to store files?

It's not just about how intensive I/O to those files is, but this also
means the files will now be included in backups / pg_rewind, and maybe
that's not really desirable?

Maybe it's fine but I'm not quite convinced about it ...

I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
described it as

   This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
   similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
   at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
   statistics update.

but I'm not sure what pending updates do you mean? Aren't we updating
the stats at the end of each transaction? At least that's what we've
been doing before, so maybe this patch changes that?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Thank you very much for the testing.

At Mon, 26 Nov 2018 02:52:30 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<6c079a69-feba-e47c-7b85-8a9ff31adef3@2ndquadrant.com>
> Hi,
> 
> Unfortunately, the patch does not apply anymore - it seems it got broken
> by the changes to signal handling and/or removal of magic OIDs :-(

Big hit but was simple.

> I've done a review and testing when applied on top of 10074651e335.
> 
> Firstly, the testing - I was wondering if the patch has some performance
> impact, so I've done some testing with a read-only workload on large
> number of tables (1k, 10k and 100k) while concurrently selecting data
> from pg_stat_* catalogs at the same time.
> 
> In one case both workloads were running against the same database, in
> another there were two separate databases (and the selects from stat
> catalogs were running against an "empty" database with no use tables).
> 
> In both cases there were 8 clients doing selects from the user tables,
> and 4 clients accessing the pg_stat_* catalogs.
> 
> For the "single database" case the results looks like this (this is just
> patched / master throughput):
> 
>     # of tables     xact     stats
>     ------------------------------
>     1000          97.71%    98.76%
>     10000        100.38%    97.97%
>     100000       100.10%    98.50%
> 
> xact is throughput of the user workload (select from the large number of
> tables) and stats is throughput for selects from system catalogs.
> 
> So pretty much no difference - 2% is within noise on this machine.

No increase by the number of tables seems suggesting that.

> On two separate databases the results are a bit more interesting:
> 
>     # of tables     xact      stats
>     -------------------------------
>     1000         100.49%     80.38%
>     10000        103.18%     80.28%
>     100000       100.85%     81.95%
> 
> For the main workload there's pretty much no difference, but for selects
> from the stats catalogs there's ~20% drop in throughput. In absolute
> numbers this means drop from ~670tps to ~550tps. I haven't investigated
> this, but I suppose this is due to dshash seqscan being more expensive
> than reading the data from file.

Thanks for finding that. The three seqscan loops in
pgstat_vacuum_stat cannot take such a long time, I think. I'll
investigate it.

> I don't think any of this is an issue in practice, though. The important
> thing is that there's no measurable impact on the regular workload.
> 
> Now, a couple of comments regarding individual parts of the patch.
> 
> 
> 0001-0003
> ---------
> 
> I do think 0001 - 0003 are ready, with some minor cosmetic issues:
> 
> 1) I'd rephrase the last part of dshash_seq_init comment more like this:
> 
> * If consistent is set for dshash_seq_init, the all hash table
> * partitions are locked in the requested mode (as determined by the
> * exclusive flag), and the locks are held until the end of the scan.
> * Otherwise the partition locks are acquired and released as needed
> * during the scan (up to two partitions may be locked at the same time).

Replaced with this.

> Maybe it should briefly explain what the consistency guarantees are (and
> aren't), but considering we're not materially changing the existing
> behavior probably  is not really necessary.

Mmm. actually sequential scan is a new thing altogether, but..

> 2) I think the dshash_find_extended() signature should be more like
> dshash_find(), i.e. just appending parameters instead of moving them
> around unnecessarily. Perhaps we should add

Sure. It seems to have done by my off-lined finger;p Fixed.

>     Assert(nowait || !lock_acquired);
> 
> Using nowait=false with lock_acquired!=NULL does not seem sensible.

Agreed. Added.

> 3) I suppose this comment in postmaster.c is just copy-paste:
> 
> -#define BACKEND_TYPE_ARCHIVER    0x0010    /* bgworker process */
> +#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */

Ugh! Fixed.

> I wonder why wasn't archiver a regular auxiliary process already? It
> seems like a fairly natural thing, so why not?

Perhaps it's just because it didn't need access to shared memory.

> 0004 (+0005 and 0007)
> ---------------------
> 
> This seems fine, but I have my doubts about two changes - removing of
> stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.
> 
> There's a couple of issues with the stats_temp_directory. Firstly, I
> don't understand why it's spread over multiple parts of the patch. The
> GUC is removed in 0004, the underlying variable is removed in 0005 and
> then the docs are updated in 0007. If we really want to do this, it
> should happen in a single patch.

Sure.

> But the main question is - do we really want to do that? I understand
> this directory was meant for the stats data we're moving to shared
> memory, so removing it seems natural. But clearly it's used by
> pg_stat_statements - 0005 fixes that, of course, but I wonder if there
> are other extensions using it to store files?
> It's not just about how intensive I/O to those files is, but this also
> means the files will now be included in backups / pg_rewind, and maybe
> that's not really desirable?
> 
> Maybe it's fine but I'm not quite convinced about it ...

It was also in my mind. Anyway sorry for the strange separation. 

I was confused about pgstat_stat_directory (the names are
actually very confusing..). Addition to that pg_stat_statements
does *not* use the variable stats_temp_directory, but using
PG_STAT_TMP_DIR. pgstat_stat_directory was used only by
basebackup.c.

The GUC base variable pgstat_temp_directory is not extern'ed so
we can just remove it along with the GUC
definition. pgstat_stat_directory (it actually stores *temporary*
stats directory) was extern'ed in pgstat.h and PG_STAT_TMP_DIR is
defined in pgstat.h. They are not removed in the new version.
Finally 0005 no longer breaks any other bins, contribs and
externalextensions.

> I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
> described it as
> 
>    This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
>    similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
>    at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
>    statistics update.
> 
> but I'm not sure what pending updates do you mean? Aren't we updating
> the stats at the end of each transaction? At least that's what we've
> been doing before, so maybe this patch changes that?

Without the timeout, updates on shared memory happens at the same
rate with transaction traffic and it easily causes congestion. So
the update frequency is limited to the timtout in this patch and
the local statistics made by trasactions committed within the
timeout interval will be merged into one shared stats update. It
is the "pending statistics".

With the socket-based stats collector, it doesn't update the
temporary stats file with the interval not shorter than the
timeout.  The update timeout seemingly behaves the same way with
the socket-based stats collector in the view of readers.

If local statistics is not fully processed at the end of the last
transaction. We don't have a chance to flush them before the next
transaction ends. So timeout is loaded if any "panding stats"
remains. (around postgres.c:4175) The pending stats are processed
forcibly in ProcessInterrupts().

postgres.c:4175
>        stats_timeout = pgstat_update_stat(false);
>        if (stats_timeout > 0)
>        {
>          disable_idle_stats_update_timeout = true;
>          enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
>                     stats_timeout);

The attached are the new version addressing the all comments here
(except the 20% regression), and rebased.

v10-0001-sequential-scan-for-dshash.patch
v10-0002-Add-conditional-lock-feature-to-dshash.patch
  fixed.
v10-0003-Make-archiver-process-an-auxiliary-process.patch
  fixed.
v10-0004-Shared-memory-based-stats-collector.patch
  updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
  collected all guc-related changes.
  updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
  basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
  small change related to 0005.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 19b514ab0bac737c9df96b8c397d8a70930fc366 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/7] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b2b8fe60e1..af904c034e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8c733bfe25..8ab1a21f3e 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 51ad62d07213b36c9c3976ee48ad493b8a7ff73e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index af904c034e..cb6c80b56a 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 8ab1a21f3e..513808b74d 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 40b7b2262dbce4a3394ec504a812121018f49028 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7caab64ce7..2d936b593d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -327,6 +327,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -454,6 +457,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 844b9d1b0e..16eb89a21c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -66,7 +66,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -85,7 +84,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -103,75 +101,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -211,8 +140,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -244,8 +173,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088e57..b1afe11a87 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2857,6 +2857,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4119,6 +4122,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a33a131182..87d1426500 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -549,6 +550,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1771,7 +1773,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2932,7 +2934,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3077,10 +3079,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3326,7 +3326,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3531,6 +3531,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3815,6 +3827,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5084,7 +5097,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5393,6 +5406,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index d6b32c070c..3026f95728 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d16b8..33c7372f00 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 292e63a26a..5db1d7a5ea 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 8b41677d0a1a254a5822728d91d3fe506680cf50 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/7] Shared-memory based stats collector

This replaces the means to share server statistics numbers from file
to dynamic shared memory. Every backend directly reads and writres to
the stats tables. Stats collector process is removed.  Updates of
shared stats happens with the intervals not shorter than 500ms and not
longer than 1s. If the shared memory data is busy and a backend cannot
obtain lock immediately, usually the differences are stashed into
"pending stats" on local memory and merged with the number at the next
chance.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/postmaster/autovacuum.c          |   59 +-
 src/backend/postmaster/bgwriter.c            |    4 +-
 src/backend/postmaster/checkpointer.c        |   24 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4201 +++++++++++---------------
 src/backend/postmaster/postmaster.c          |   80 +-
 src/backend/replication/logical/tablesync.c  |    9 +-
 src/backend/replication/logical/worker.c     |    4 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/dsm.c                |   24 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    3 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/adt/pgstatfuncs.c          |   50 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    2 +-
 src/include/miscadmin.h                      |    2 +-
 src/include/pgstat.h                         |  437 +--
 src/include/storage/dsm.h                    |    3 +
 src/include/storage/lwlock.h                 |    3 +
 src/include/utils/timeout.h                  |    1 +
 src/test/modules/worker_spi/worker_spi.c     |    2 +-
 25 files changed, 1932 insertions(+), 3038 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..307010414c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8410,9 +8410,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2d5086d406..c0ad141715 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -969,7 +969,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -978,6 +978,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -993,7 +994,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1005,6 +1006,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1017,7 +1019,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1029,6 +1031,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1227,7 +1230,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,16 +1268,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1963,7 +1972,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2013,7 +2022,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2099,6 +2108,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2178,10 +2189,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2750,12 +2762,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2787,8 +2797,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2819,6 +2829,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2909,7 +2921,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 7612b17b44..d6379e0773 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -267,9 +267,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e156..0378c31d2f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -371,7 +371,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -397,7 +397,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -515,13 +515,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +682,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1284,8 +1284,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 16eb89a21c..34979669a8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -413,7 +413,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -423,7 +423,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b1afe11a87..e6c7869f5f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -19,92 +14,59 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -132,27 +94,69 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
 
-static struct sockaddr_storage pgStatAddr;
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
@@ -189,18 +193,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -237,6 +237,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -250,23 +256,15 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,32 +278,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -316,320 +323,16 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
 /*
  * subroutine for pgstat_reset_all
  */
@@ -678,119 +381,54 @@ pgstat_reset_remove_files(const char *directory)
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
+/* ----------
+ * pgstat_create_shared_stats() -
  *
- * Format up the arglist for, then fork and exec, statistics collector process
+ *    create shared stats memory
+ * ----------
  */
-static pid_t
-pgstat_forkexec(void)
+static void
+pgstat_create_shared_stats(void)
 {
-    char       *av[10];
-    int            ac = 0;
+    MemoryContext oldcontext;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
 
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
 
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
 
 /* ------------------------------------------------------------
  * Public functions used by backends follow
@@ -802,41 +440,107 @@ allow_immediate_pgstat_restart(void)
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
  */
-void
-pgstat_report_stat(bool force)
+long
+pgstat_update_stat(bool force)
 {
     /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
     static TimestampTz last_report = 0;
-
+    static TimestampTz oldest_pending = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
     last_report = now;
 
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
      * entries it points to.  (Should we fail partway through the loop below,
@@ -849,23 +553,55 @@ pgstat_report_stat(bool force)
     pgStatTabHash = NULL;
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
 
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            PgStat_TableStatus *pentry = NULL;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -878,178 +614,440 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
 {
-    int            n;
-    int            len;
+    Assert (deststat != srcstat);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
     }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
 }
-
+        
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
  */
 static void
-pgstat_send_funcstats(void)
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* find the shared function stats table */
+    if (!dbentry)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
 
-    have_function_stats = false;
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
 }
 
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1057,148 +1055,86 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
-/* ----------
+/*
  * pgstat_collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
  */
 static HTAB *
 pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
@@ -1245,62 +1181,54 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1309,20 +1237,51 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1331,29 +1290,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert(db_stats);
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1362,17 +1329,90 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1386,48 +1426,75 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Repot about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1438,9 +1505,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1469,114 +1541,228 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgstat_pending_tempfile)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * clean up function for temporary files
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1692,9 +1878,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1716,6 +1899,15 @@ pgstat_initstats(Relation rel)
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
 
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
           relkind == RELKIND_MATVIEW ||
@@ -1727,13 +1919,6 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
     /*
      * If we already set up this relation in the current transaction, nothing
      * to do.
@@ -2377,34 +2562,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2417,47 +2574,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2476,18 +2614,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2562,9 +2696,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2579,9 +2715,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2771,7 +2909,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2963,7 +3101,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3230,7 +3368,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4150,96 +4289,68 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4247,299 +4358,15 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4565,20 +4392,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4587,47 +4411,76 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 &lock_acquired);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4663,29 +4516,23 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4704,7 +4551,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4716,32 +4563,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4784,16 +4628,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4801,15 +4635,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4827,10 +4660,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4839,9 +4672,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4868,23 +4702,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -4919,47 +4760,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     /*
      * The tables will live in pgStatLocalContext.
@@ -4967,28 +4791,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5002,11 +4816,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        LWLockRelease(StatsLock);
+        return;
     }
 
     /*
@@ -5015,7 +4830,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5023,11 +4838,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5038,17 +4854,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5068,7 +4884,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5077,21 +4893,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5099,54 +4917,26 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5154,36 +4944,62 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
+    LWLockRelease(StatsLock);
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5194,7 +5010,10 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5208,7 +5027,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5221,7 +5040,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5241,7 +5060,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5253,19 +5072,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5275,7 +5096,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5287,19 +5108,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5309,7 +5131,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5319,276 +5141,290 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5619,6 +5455,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5628,717 +5466,112 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
+
+    /*
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
+     */
+    backend_clean_snapshot_callback(¶m);
 }
 
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Otherwise add the values to the existing entry.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
     /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 87d1426500..d4dcc591d0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -1308,12 +1307,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1766,11 +1759,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2603,8 +2591,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2935,8 +2921,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3003,13 +2987,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3084,22 +3061,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3558,22 +3519,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3777,8 +3722,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3817,8 +3760,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4019,8 +3961,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4993,18 +4933,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5117,12 +5045,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 38ae1b9ab8..4e38d5ef6c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -127,7 +127,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -143,6 +143,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -524,7 +527,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -862,7 +865,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8d5e0946c4..a51c0de699 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -492,7 +492,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1326,6 +1326,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5706..e794a81c4c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1984,7 +1984,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2092,7 +2092,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2282,7 +2282,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2290,7 +2290,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index edee89c116..18e73b0288 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..cce6d3ffa2 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -282,8 +283,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb..c46bb8d057 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..798af9f168 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a3b9757565..ee4e43331b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3146,6 +3146,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4160,9 +4167,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4197,7 +4212,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4205,6 +4220,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index f955f1912a..6e144d5aad 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1850,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1882,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index c6939779b9..1377bbbbdb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b636b1e262..701bfed727 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3e1c3863c4..25b3b2a079 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3026f95728..91c3fb1a0a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 33c7372f00..a6f7cd44ab 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -41,32 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,271 +148,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -485,79 +205,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -601,10 +248,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1136,13 +786,15 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1154,34 +806,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1192,6 +830,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1219,6 +859,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1338,15 +981,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1355,4 +998,14 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index b4654cb5ca..379f0bc5c0 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73287..4cb628b15f 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index dcc7307c16..b8a56645b6 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index 838133a208..0c6fabad49 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From d8489d5da5bc06c34661ff461064b25fce8958a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/7] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 src/backend/postmaster/pgstat.c               | 12 +++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 5 files changed, 11 insertions(+), 61 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6c7869f5f..b66a246182 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -89,15 +89,11 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This was a GUC parameter and no longer used in this file. But left alone
+ * just for backward comptibility for extensions, having the default value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a7e3db2783..bcb93d1613 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -223,11 +223,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -258,13 +255,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6497393c03..34bf9f1419 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -189,7 +189,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3938,17 +3937,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10931,35 +10919,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee9ec6a120..27108471da 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -550,7 +550,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a6f7cd44ab..386f8040fe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3

From 7fff9db43c3db022da76fb8ba35e76e32b6c298b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 9 Nov 2018 15:48:49 +0900
Subject: [PATCH 6/7] Split out backend status monitor part from pgstat

A large file, pgstat.c, contained two major facilities, backend status
monitor and database usage monitor. Split out the former part from the
file and name the module "bestatus". The names of individual functions
are left alone except for some conficts.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    1 +
 src/backend/access/transam/xact.c                  |    1 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/commands/vacuumlazy.c                  |    1 +
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |    1 +
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    1 +
 src/backend/postmaster/checkpointer.c              |    1 +
 src/backend/postmaster/pgarch.c                    |    1 +
 src/backend/postmaster/postmaster.c                |    1 +
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |    6 +-
 src/backend/replication/logical/worker.c           |   11 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1756 ++++++++++++++++++++
 src/backend/{postmaster => statmon}/pgstat.c       | 1727 +------------------
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    2 +-
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |    1 +
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |    1 +
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |    4 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/include/bestatus.h                             |  544 ++++++
 src/include/pgstat.h                               |  514 +-----
 73 files changed, 2441 insertions(+), 2259 deletions(-)
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 rename src/backend/{postmaster => statmon}/pgstat.c (70%)
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 033ad36477..fcb658c3dd 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/heapam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 33f9a79f54..fbfe062405 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index a6509932dc..ea2ceab312 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 25eb043941..888bf1cab8 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index d5bd282f8c..b293a8ef86 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1..6679dbc3a5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f5755777..14d183b0da 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -63,9 +63,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 8b7ff5b0c2..9971bfe4f2 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index b9a9ae5c73..199e8ebf24 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index fad5d363e3..365531f5bd 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 70eec5676e..0438b9b16a 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e65dccc6a2..ab3e9c272a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400384..f3a9b8e3ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 307010414c..f3138cbd8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index f139eeff9f..35e42155d8 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc9220f..b739f650d6 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 2d936b593d..11a854970d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -20,6 +20,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -327,9 +328,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -339,6 +337,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -415,6 +416,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253..28b890d8c3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 13ef232d39..a99ea3dbfe 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 00d02fd50f..1547b111c4 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -40,6 +40,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index e6367ade76..66cf577bd6 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index ba2f6686cf..1545ed4c57 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c2c8beffc1..9747c6afc1 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 6955d7230c..5a2ff87bbd 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index 7cfafb5908..1be1774556 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index 603d9016fd..9c4b81c3f9 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c0ad141715..540c70a518 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index d2b695e146..01eaa187ff 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d6379e0773..9712d0801e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0378c31d2f..130a9c842d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 34979669a8..56e33dae8d 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -34,6 +34,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d4dcc591d0..a893b6a9d4 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index fbeee31109..0e1cecb05d 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 0ae733e886..67503e32b7 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index bcb93d1613..5eddb2eb95 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 9b75711ebd..4c28d30641 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 3a84d8ca86..be1e5b8de0 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index bf97dcdee4..a60ef0a9f1 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index bed63c768e..7af1e6b6b1 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 363ddf4505..c9fa5baf2e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4e38d5ef6c..80bce89104 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,25 +86,27 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index a51c0de699..f4058ed619 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -23,13 +23,11 @@
 
 #include "postgres.h"
 
-#include "miscadmin.h"
-#include "pgstat.h"
-#include "funcapi.h"
-
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -41,17 +39,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 1f2e7139a7..1620313c55 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 9a13c50ce8..277460efe7 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b..f1447e5531 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 46edb525e8..284b3ce8e3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..1ea4f80a58
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1756 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWRITE";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/statmon/pgstat.c
similarity index 70%
rename from src/backend/postmaster/pgstat.c
rename to src/backend/statmon/pgstat.c
index b66a246182..31d7ade130 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -8,7 +8,7 @@
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
- *    src/backend/postmaster/pgstat.c
+ *    src/backend/statmon/pgstat.c
  * ----------
  */
 #include "postgres.h"
@@ -21,19 +21,14 @@
 #include "access/htup_details.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -68,26 +63,12 @@ typedef enum
     PGSTAT_ENTRY_LOCK_FAILED
 } pg_stat_table_result_status;
 
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
 /* ----------
  * GUC parameters
  * ----------
  */
-bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
 
 /*
  * This was a GUC parameter and no longer used in this file. But left alone
@@ -131,6 +112,8 @@ static bool pgstat_pending_recoveryconflict = false;
 static bool pgstat_pending_deadlock = false;
 static bool pgstat_pending_tempfile = false;
 
+static MemoryContext pgStatLocalContext = NULL;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -242,15 +225,8 @@ typedef struct
 /*
  * Info about current "snapshot" of stats file
  */
-static MemoryContext pgStatLocalContext = NULL;
 static HTAB *pgStatDBHash = NULL;
 
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
 /*
  * Cluster wide statistics.
  * Contains statistics that are not collected per database or per table.
@@ -286,7 +262,6 @@ static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dsha
 /* functions used in backends */
 static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
-static void pgstat_read_current_status(void);
 
 static void pgstat_postmaster_shutdown(int code, Datum arg);
 static void pgstat_apply_pending_tabstats(bool shared, bool force,
@@ -313,12 +288,6 @@ static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
 
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
 static bool pgstat_update_tabentry(dshash_table *tabhash,
                                    PgStat_TableStatus *stat, bool nowait);
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
@@ -329,6 +298,14 @@ static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
  * ------------------------------------------------------------
  */
 
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
 /*
  * subroutine for pgstat_reset_all
  */
@@ -490,7 +467,7 @@ pgstat_update_stat(bool force)
          */
         TimestampDifference(last_report, now, &secs, &usecs);
         elapsed = secs * 1000 + usecs /1000;
-        
+
         if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
         {
             /* we know we have some statistics */
@@ -746,7 +723,7 @@ pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
             pgStatBlockReadTime = 0;
             pgStatBlockWriteTime = 0;
         }
-        
+
         cxt->tabhash =
             dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
@@ -806,7 +783,7 @@ pgstat_merge_tabentry(PgStat_TableStatus *deststat,
         dest->t_blocks_hit += src->t_blocks_hit;
     }
 }
-        
+
 /*
  * pgstat_update_funcstats: subroutine for pgstat_update_stat
  *
@@ -926,7 +903,7 @@ pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
                 hash_search(pgStatPendingFunctions,
                             (void *) &(pendent->functionid), HASH_REMOVE, NULL);
             }
-        }    
+        }
 
         /* destroy the hsah if no entry remains */
         if (hash_get_num_entries(pgStatPendingFunctions) == 0)
@@ -1064,7 +1041,7 @@ pgstat_vacuum_stat(void)
     dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
     if (!dbentry)
         return;
-    
+
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -2621,66 +2598,6 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     return funcentry;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
@@ -2718,364 +2635,6 @@ pgstat_fetch_global(void)
     return snapshot_globalStats;
 }
 
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    before_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3088,8 +2647,6 @@ pgstat_bestart(void)
 static void
 pgstat_beshutdown_hook(int code, Datum arg)
 {
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
     /*
      * If we got as far as discovering our own database ID, we can report what
      * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3098,1188 +2655,9 @@ pgstat_beshutdown_hook(int code, Datum arg)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_update_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
 }
 
 
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(IsUnderPostmaster);
-
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWRITE";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
 /* ------------------------------------------------------------
  * Local support functions follow
  * ------------------------------------------------------------
@@ -5422,22 +3800,6 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               funcid);
 }
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5453,6 +3815,8 @@ pgstat_clear_snapshot(void)
 {
     int param = 0;    /* only the address is significant */
 
+    pgstat_bestatus_clear_snapshot();
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5460,8 +3824,6 @@ pgstat_clear_snapshot(void)
     /* Reset variables */
     pgStatLocalContext = NULL;
     pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
 
     /*
      * the parameter inform the function that it is not called from
@@ -5567,47 +3929,18 @@ pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
+/* ----------
+ * pgstat_setup_memcxt() -
  *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
  */
-char *
-pgstat_clip_activity(const char *raw_activity)
+static void
+pgstat_setup_memcxt(void)
 {
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
 }
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e794a81c4c..d92c7c935d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index d3722616bb..d9f1e94d99 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 4a0d23b11e..4054ac5108 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9e596e7868..466622a1c2 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index 0ff1f5be91..a3465f57ae 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index b0804537cf..33ab24102d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dc7e875680..293d15661a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index ec0ddd537b..04b933986f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index c9bb3e987d..0a9181cd9d 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index aeaf1f3ab4..22a42b9977 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index c46bb8d057..8ddb4c88e0 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index e8390311d0..ac352885f3 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 33387fb71b..64594a402e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a50509f..9089cad93f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ee4e43331b..f894bac680 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index b8f86973dc..9506f4e671 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -20,6 +20,7 @@
 #include <unistd.h>
 
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -28,7 +29,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6e144d5aad..93f62aab71 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 328d4aae7b..b7c3d27c8f 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 3d10aa5707..2c82261709 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 701bfed727..4c152be61b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -25,6 +25,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -688,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 34bf9f1419..dd6618855f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..3b47e9c063
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,544 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bestatus_clear_snapshot(void);
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 386f8040fe..2d8091e579 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
  *
@@ -14,11 +14,8 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "lib/dshash.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -100,12 +97,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -173,10 +169,10 @@ typedef struct PgStat_BgWriter
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -209,7 +205,7 @@ typedef struct PgStat_FunctionEntry
 } PgStat_FunctionEntry;
 
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -313,7 +309,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -329,7 +325,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -347,422 +343,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -784,10 +364,8 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
 
 /* No longer used, but will be removed with GUC */
@@ -836,26 +414,9 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_clear_snapshot(void);
 
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
@@ -866,60 +427,6 @@ extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
 extern HTAB *backend_snapshot_all_db_entries(void);
 extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -987,6 +494,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_update_archiver(const char *xlog, bool failed);
 extern void pgstat_update_bgwriter(void);
 
+extern void pgstat_report_tempfile(size_t filesize);
+
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
@@ -994,10 +503,7 @@ extern void pgstat_update_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-- 
2.16.3

From 37c00b49702b5d83758643a0df188f16894d9401 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 7/7] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml     |  2 --
 doc/src/sgml/config.sgml     | 19 -------------------
 doc/src/sgml/monitoring.sgml |  7 +------
 doc/src/sgml/storage.sgml    |  3 +--
 4 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index db1a2d4e74..2c07014ffb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6643,25 +6643,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7aada14417..583e14b6f3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..e137e6b494 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
-- 
2.16.3


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:
 >>
 >> ...>>
>> For the main workload there's pretty much no difference, but for selects
>> from the stats catalogs there's ~20% drop in throughput. In absolute
>> numbers this means drop from ~670tps to ~550tps. I haven't investigated
>> this, but I suppose this is due to dshash seqscan being more expensive
>> than reading the data from file.
> 
> Thanks for finding that. The three seqscan loops in
> pgstat_vacuum_stat cannot take such a long time, I think. I'll
> investigate it.
> 

OK. I'm not sure this is related to pgstat_vacuum_stat - the slowdown 
happens while querying the catalogs, so why would that trigger vacuum of 
the stats? I may be missing something, of course.

FWIW, the "query statistics" test simply does this:

   SELECT * FROM pg_stat_all_tables;
   SELECT * FROM pg_stat_all_indexes;
   SELECT * FROM pg_stat_user_indexes;
   SELECT * FROM pg_stat_user_tables;
   SELECT * FROM pg_stat_sys_tables;
   SELECT * FROM pg_stat_sys_indexes;

and the slowdown happened even it was running on it's own (nothing else 
running on the instance). Which mostly rules out concurrency issues with 
the hash table locking etc.

>> I don't think any of this is an issue in practice, though. The important
>> thing is that there's no measurable impact on the regular workload.
>>
>> Now, a couple of comments regarding individual parts of the patch.
>>
>>
>> 0001-0003
>> ---------
>>
>> I do think 0001 - 0003 are ready, with some minor cosmetic issues:
>>
>> 1) I'd rephrase the last part of dshash_seq_init comment more like this:
>>
>> * If consistent is set for dshash_seq_init, the all hash table
>> * partitions are locked in the requested mode (as determined by the
>> * exclusive flag), and the locks are held until the end of the scan.
>> * Otherwise the partition locks are acquired and released as needed
>> * during the scan (up to two partitions may be locked at the same time).
> 
> Replaced with this.
> 
>> Maybe it should briefly explain what the consistency guarantees are (and
>> aren't), but considering we're not materially changing the existing
>> behavior probably  is not really necessary.
> 
> Mmm. actually sequential scan is a new thing altogether, but..
> 

Sure, there are new pieces. But does it significantly change consistency 
guarantees when reading the stats? I don't think so - there was no 
strict consistency guaranteed before (due to data interleaved with 
inquiries, UDP throwing away packets under load, etc.). Based on the 
discussion in this thread that seems to be the consensus.

>> 2) I think the dshash_find_extended() signature should be more like
>> dshash_find(), i.e. just appending parameters instead of moving them
>> around unnecessarily. Perhaps we should add
> 
> Sure. It seems to have done by my off-lined finger;p Fixed.
> 

;-)


>> 0004 (+0005 and 0007)
>> ---------------------
>>
>> This seems fine, but I have my doubts about two changes - removing of
>> stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.
>>
>> There's a couple of issues with the stats_temp_directory. Firstly, I
>> don't understand why it's spread over multiple parts of the patch. The
>> GUC is removed in 0004, the underlying variable is removed in 0005 and
>> then the docs are updated in 0007. If we really want to do this, it
>> should happen in a single patch.
> 
> Sure.
> 
>> But the main question is - do we really want to do that? I understand
>> this directory was meant for the stats data we're moving to shared
>> memory, so removing it seems natural. But clearly it's used by
>> pg_stat_statements - 0005 fixes that, of course, but I wonder if there
>> are other extensions using it to store files?
>> It's not just about how intensive I/O to those files is, but this also
>> means the files will now be included in backups / pg_rewind, and maybe
>> that's not really desirable?
>>
>> Maybe it's fine but I'm not quite convinced about it ...
> 
> It was also in my mind. Anyway sorry for the strange separation.
> 
> I was confused about pgstat_stat_directory (the names are
> actually very confusing..). Addition to that pg_stat_statements
> does *not* use the variable stats_temp_directory, but using
> PG_STAT_TMP_DIR. pgstat_stat_directory was used only by
> basebackup.c.
> 
> The GUC base variable pgstat_temp_directory is not extern'ed so
> we can just remove it along with the GUC
> definition. pgstat_stat_directory (it actually stores *temporary*
> stats directory) was extern'ed in pgstat.h and PG_STAT_TMP_DIR is
> defined in pgstat.h. They are not removed in the new version.
> Finally 0005 no longer breaks any other bins, contribs and
> external extensions.
> 

Great. I'll take a look.

>> I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
>> described it as
>>
>>     This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
>>     similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
>>     at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
>>     statistics update.
>>
>> but I'm not sure what pending updates do you mean? Aren't we updating
>> the stats at the end of each transaction? At least that's what we've
>> been doing before, so maybe this patch changes that?
> 
> Without the timeout, updates on shared memory happens at the same
> rate with transaction traffic and it easily causes congestion. So
> the update frequency is limited to the timtout in this patch and
> the local statistics made by trasactions committed within the
> timeout interval will be merged into one shared stats update. It
> is the "pending statistics".
> 
> With the socket-based stats collector, it doesn't update the
> temporary stats file with the interval not shorter than the
> timeout.  The update timeout seemingly behaves the same way with
> the socket-based stats collector in the view of readers.
> 
> If local statistics is not fully processed at the end of the last
> transaction. We don't have a chance to flush them before the next
> transaction ends. So timeout is loaded if any "panding stats"
> remains. (around postgres.c:4175) The pending stats are processed
> forcibly in ProcessInterrupts().
> 

OK, thanks for the explanation. So it's essentially a protection against 
stats from short transactions not being reported for a long time, when 
the next transaction is long. For example we might end up with a 
sequence of short transactions

     T1: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
     T2: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
     ...
     TN: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
     T(N+1): long (say, several hours)

in which case stats from short ones are not reported until the end of 
the long one. That makes sense.

That however raises the question - won't that also report some of the 
stats from the last transaction? That would be a change compared to 
current behavior, although I'm not sure it's undesirable - it's often 
quite annoying that we don't receive stats from a transaction until it 
completes. But I wonder - doesn't this affect pg_stat_xact catalogs?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:
> ...
> 
> v10-0001-sequential-scan-for-dshash.patch
> v10-0002-Add-conditional-lock-feature-to-dshash.patch
>   fixed.
> v10-0003-Make-archiver-process-an-auxiliary-process.patch
>   fixed.
> v10-0004-Shared-memory-based-stats-collector.patch
>   updated not to touch guc.
> v10-0005-Remove-the-GUC-stats_temp_directory.patch
>   collected all guc-related changes.
>   updated not to break other programs.
> v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
>   basebackup.c requires both bestats.h and pgstat.h
> v10-0007-Documentation-update.patch
>   small change related to 0005.
> 

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Thinking about it a bit more, I'm wondering if we need to keep 0004 and
0005 separate. My understanding is that the stats_temp_directory is used
only from the stats collector, so it probably does not make much sense
to keep it after 0004. We may also keep it separate and then commit both
0004 and 0005 together, of course. What do you think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2018-Nov-28, Tomas Vondra wrote:

> > v10-0004-Shared-memory-based-stats-collector.patch
> >   updated not to touch guc.
> > v10-0005-Remove-the-GUC-stats_temp_directory.patch
> >   collected all guc-related changes.
> >   updated not to break other programs.
> > v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
> >   basebackup.c requires both bestats.h and pgstat.h
> > v10-0007-Documentation-update.patch
> >   small change related to 0005.
> 
> I need to do a more thorough review of part 0006, but these patches
> seems quite fine to me. I'd however merge 0007 into the other relevant
> parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
I suggest to have 0004+0006 be a single commit, mostly because
introducing a bunch of "new" code in 0004 and then moving it over to
bestatus.c in 0006 makes "git blame" doubly painful.  And I think
committing 0005 and not 0007 makes the documentation temporarily buggy,
so I see no reason to think of this as two commits, one being 0004+0006
and the other 0005+0007.  And even those could conceivably be pushed
together instead of as a single patch.  (But be sure to push very early
in your work day, to have plenty of time to deal with any resulting
buildfarm problems.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On 11/29/18 1:18 PM, Alvaro Herrera wrote:
> On 2018-Nov-28, Tomas Vondra wrote:
> 
>>> v10-0004-Shared-memory-based-stats-collector.patch
>>>   updated not to touch guc.
>>> v10-0005-Remove-the-GUC-stats_temp_directory.patch
>>>   collected all guc-related changes.
>>>   updated not to break other programs.
>>> v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
>>>   basebackup.c requires both bestats.h and pgstat.h
>>> v10-0007-Documentation-update.patch
>>>   small change related to 0005.
>>
>> I need to do a more thorough review of part 0006, but these patches
>> seems quite fine to me. I'd however merge 0007 into the other relevant
>> parts (it seems like a mix of docs changes for 0004, 0005 and 0006).
> 
> Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
> I suggest to have 0004+0006 be a single commit, mostly because
> introducing a bunch of "new" code in 0004 and then moving it over to
> bestatus.c in 0006 makes "git blame" doubly painful.  And I think
> committing 0005 and not 0007 makes the documentation temporarily buggy,
> so I see no reason to think of this as two commits, one being 0004+0006
> and the other 0005+0007.  And even those could conceivably be pushed
> together instead of as a single patch.  (But be sure to push very early
> in your work day, to have plenty of time to deal with any resulting
> buildfarm problems.)
> 

Kyotaro-san, do you agree with committing the patch the way Alvaro
proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
0005+0007 together. The plan seems reasonable to me.

FWIW I see cputube reports some build failures on Windows:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.26736#L3135

If I understand it correctly, it complains about this line in postmaster.c:

extern pgsocket pgStatSock;

which seems to only affect EXEC_BACKEND (including Win32). ISTM we
should get rid of all pgStatSock references, per the attached fix.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Вложения

Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2019-01-01 18:39:12 +0100, Tomas Vondra wrote:
> On 11/29/18 1:18 PM, Alvaro Herrera wrote:
> > On 2018-Nov-28, Tomas Vondra wrote:
> >
> >>> v10-0004-Shared-memory-based-stats-collector.patch
> >>>   updated not to touch guc.
> >>> v10-0005-Remove-the-GUC-stats_temp_directory.patch
> >>>   collected all guc-related changes.
> >>>   updated not to break other programs.
> >>> v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
> >>>   basebackup.c requires both bestats.h and pgstat.h
> >>> v10-0007-Documentation-update.patch
> >>>   small change related to 0005.
> >>
> >> I need to do a more thorough review of part 0006, but these patches
> >> seems quite fine to me. I'd however merge 0007 into the other relevant
> >> parts (it seems like a mix of docs changes for 0004, 0005 and 0006).
> >
> > Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
> > I suggest to have 0004+0006 be a single commit, mostly because
> > introducing a bunch of "new" code in 0004 and then moving it over to
> > bestatus.c in 0006 makes "git blame" doubly painful.  And I think
> > committing 0005 and not 0007 makes the documentation temporarily buggy,
> > so I see no reason to think of this as two commits, one being 0004+0006
> > and the other 0005+0007.  And even those could conceivably be pushed
> > together instead of as a single patch.  (But be sure to push very early
> > in your work day, to have plenty of time to deal with any resulting
> > buildfarm problems.)
> >
>
> Kyotaro-san, do you agree with committing the patch the way Alvaro
> proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
> 0005+0007 together. The plan seems reasonable to me.

Do you guys think these patches are ready already? I'm a bit doubtful, and
failures here could have quite wide-ranging symptoms.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:

On 1/1/19 7:03 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-01-01 18:39:12 +0100, Tomas Vondra wrote:
>> On 11/29/18 1:18 PM, Alvaro Herrera wrote:
>>> On 2018-Nov-28, Tomas Vondra wrote:
>>>
>>>>> v10-0004-Shared-memory-based-stats-collector.patch
>>>>>   updated not to touch guc.
>>>>> v10-0005-Remove-the-GUC-stats_temp_directory.patch
>>>>>   collected all guc-related changes.
>>>>>   updated not to break other programs.
>>>>> v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
>>>>>   basebackup.c requires both bestats.h and pgstat.h
>>>>> v10-0007-Documentation-update.patch
>>>>>   small change related to 0005.
>>>>
>>>> I need to do a more thorough review of part 0006, but these patches
>>>> seems quite fine to me. I'd however merge 0007 into the other relevant
>>>> parts (it seems like a mix of docs changes for 0004, 0005 and 0006).
>>>
>>> Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
>>> I suggest to have 0004+0006 be a single commit, mostly because
>>> introducing a bunch of "new" code in 0004 and then moving it over to
>>> bestatus.c in 0006 makes "git blame" doubly painful.  And I think
>>> committing 0005 and not 0007 makes the documentation temporarily buggy,
>>> so I see no reason to think of this as two commits, one being 0004+0006
>>> and the other 0005+0007.  And even those could conceivably be pushed
>>> together instead of as a single patch.  (But be sure to push very early
>>> in your work day, to have plenty of time to deal with any resulting
>>> buildfarm problems.)
>>>
>>
>> Kyotaro-san, do you agree with committing the patch the way Alvaro
>> proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
>> 0005+0007 together. The plan seems reasonable to me.
> 
> Do you guys think these patches are ready already? I'm a bit doubtful, and
> failures here could have quite wide-ranging symptoms.
> 

I agree it's a sensitive part of the code, so additional reviews would
be welcome of course. I've done as much review and testing as possible,
and overall it seems in a fairly good shape. Do you have any particular
concerns / ideas what to look for?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2019-Jan-01, Tomas Vondra wrote:

> I agree it's a sensitive part of the code, so additional reviews would
> be welcome of course. I've done as much review and testing as possible,
> and overall it seems in a fairly good shape. Do you have any particular
> concerns / ideas what to look for?

I haven't reviewed this patch thoroughly.

Shall we do a triage run over the complete commitfest to determine the
highest priority items that we should put extra effort into reviewing?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
Hi,

The patch needs rebasing, as it got broken by 285d8e1205, and there's
some other minor bitrot.

On 11/27/18 4:40 PM, Tomas Vondra wrote:
> On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:
>>>
>>> ...>>
>>> For the main workload there's pretty much no difference, but for
>>> selects from the stats catalogs there's ~20% drop in throughput.
>>> In absolute numbers this means drop from ~670tps to ~550tps. I
>>> haven't investigated this, but I suppose this is due to dshash
>>> seqscan being more expensive than reading the data from file.
>>
>> Thanks for finding that. The three seqscan loops in 
>> pgstat_vacuum_stat cannot take such a long time, I think. I'll 
>> investigate it.
>>
> 
> OK. I'm not sure this is related to pgstat_vacuum_stat - the
> slowdown happens while querying the catalogs, so why would that
> trigger vacuum of the stats? I may be missing something, of course.
> 
> FWIW, the "query statistics" test simply does this:
> 
>   SELECT * FROM pg_stat_all_tables;
>   SELECT * FROM pg_stat_all_indexes;
>   SELECT * FROM pg_stat_user_indexes;
>   SELECT * FROM pg_stat_user_tables;
>   SELECT * FROM pg_stat_sys_tables;
>   SELECT * FROM pg_stat_sys_indexes;
> 
> and the slowdown happened even it was running on it's own (nothing
> else running on the instance). Which mostly rules out concurrency
> issues with the hash table locking etc.
> 

Did you have time to investigate the slowdown?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Thank you very much for reviewing this and sorry for the absense.

At Sun, 20 Jan 2019 18:13:04 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<b760035b-1941-38bb-5e84-c2fbc63fef6b@2ndquadrant.com>
>
> Hi,
>
> The patch needs rebasing, as it got broken by 285d8e1205, and there's
> some other minor bitrot.


The most affected part was 0006 because of file splitting, but
actually only the follwing four (actually three) commits
affected.

42e2a58071 Fix typos in documentation and for one wait event
97c39498e5 Update copyright for 2019
578b229718 Remove WITH OIDS support, change oid catalog column visibility.
(125f551c8b Leave SIGTTIN/SIGTTOU signal handling alone in postmaster child processes.)

The last one is not relevant because stats collector is no longer
a process.

This contains the EXEC_BACKEND related bug pointed by

https://www.postgresql.org/message-id/854d6d91-f2f3-e391-f0fc-064db51b391e@2ndquadrant.com

> On 11/27/18 4:40 PM, Tomas Vondra wrote:
> > On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:
> >>>
> >>> ...>>
> >>> For the main workload there's pretty much no difference, but for
> >>> selects from the stats catalogs there's ~20% drop in throughput.
> >>> In absolute numbers this means drop from ~670tps to ~550tps. I
> >>> haven't investigated this, but I suppose this is due to dshash
> >>> seqscan being more expensive than reading the data from file.
> >>
> >> Thanks for finding that. The three seqscan loops in
> >> pgstat_vacuum_stat cannot take such a long time, I think. I'll
> >> investigate it.
> >>
> >
> > OK. I'm not sure this is related to pgstat_vacuum_stat - the
> > slowdown happens while querying the catalogs, so why would that
> > trigger vacuum of the stats? I may be missing something, of course.
> >
> > FWIW, the "query statistics" test simply does this:
> >
> >   SELECT * FROM pg_stat_all_tables;
> >   SELECT * FROM pg_stat_all_indexes;
> >   SELECT * FROM pg_stat_user_indexes;
> >   SELECT * FROM pg_stat_user_tables;
> >   SELECT * FROM pg_stat_sys_tables;
> >   SELECT * FROM pg_stat_sys_indexes;
> >
> > and the slowdown happened even it was running on it's own (nothing
> > else running on the instance). Which mostly rules out concurrency
> > issues with the hash table locking etc.
> >
>
> Did you have time to investigate the slowdown?

It seems to me that the slowdown comes from local caching in
snapshot_statentry in several ways.

It searches local hash (HTAB), then shared hash (dshash) if not
found and copies the found entry into local hash (action A). *If*
the second reference in a transaction comes, HTAB returns the
result (action B). But it mostly takes action A in frequent-short
transactions. It can be reduced to the update interval of shared
stats, but it would be shorter if many backends runs.

Another bottle neck found in pgstat_fetch_stat_tabentry. It calls
pgstat_fetch_stat_dbentry() too often. It can be largely reduced.

A quick (and dirty) fix of the aboves reduced the slowdown
roughly by half. (59tps(master)->48tps(current)->54tps(the fix))

I'll reconsider the referer side of the stats.

I didn't merge the suggested two pairs of commits. I'll do that
after adressing the slowdown issue.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
From 7149e93d7b41af0c7ce1cddc847a9bb7bc31b1e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/7] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..d1908a6137 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 8dafcc8293b856f42bc3a68fa792ea139fd8d0cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index d1908a6137..db8d6899af 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..fe1d4d75c5 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 90522c1de96ac84ba2ad7cc1ada47c7bb9f95e10 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 63bb134949..df926d8dea 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 13da412c59..d1fe052abf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2857,6 +2857,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4119,6 +4122,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3052bbbc21..65eab02b3e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1757,7 +1759,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2920,7 +2922,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3065,10 +3067,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3314,7 +3314,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3519,6 +3519,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3795,6 +3807,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5064,7 +5077,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5342,6 +5355,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 313ca5f3c3..f299d1d601 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From c9a252e86ef04b9b59ebdb19c7c3dbabf3422e97 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/7] Shared-memory based stats collector

This replaces the means to share server statistics numbers from file
to dynamic shared memory. Every backend directly reads and writres to
the stats tables. Stats collector process is removed.  Updates of
shared stats happens with the intervals not shorter than 500ms and not
longer than 1s. If the shared memory data is busy and a backend cannot
obtain lock immediately, usually the differences are stashed into
"pending stats" on local memory and merged with the number at the next
chance.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/postmaster/autovacuum.c          |   59 +-
 src/backend/postmaster/bgwriter.c            |    4 +-
 src/backend/postmaster/checkpointer.c        |   24 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4201 +++++++++++---------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/replication/logical/tablesync.c  |    9 +-
 src/backend/replication/logical/worker.c     |    4 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/dsm.c                |   24 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    3 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/adt/pgstatfuncs.c          |   50 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    2 +-
 src/include/miscadmin.h                      |    2 +-
 src/include/pgstat.h                         |  437 +--
 src/include/storage/dsm.h                    |    3 +
 src/include/storage/lwlock.h                 |    3 +
 src/include/utils/timeout.h                  |    1 +
 src/test/modules/worker_spi/worker_spi.c     |    2 +-
 25 files changed, 1932 insertions(+), 3043 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..9e45581d89 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8416,9 +8416,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 4cf67873b1..a69ea230fb 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -969,7 +969,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -978,6 +978,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -993,7 +994,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1005,6 +1006,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1017,7 +1019,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1029,6 +1031,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1227,7 +1230,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,16 +1268,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1963,7 +1972,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2013,7 +2022,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2099,6 +2108,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2178,10 +2189,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2750,12 +2762,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2787,8 +2797,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2819,6 +2829,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2909,7 +2921,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..3fb6badea8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -267,9 +267,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..d58193774e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -371,7 +371,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -397,7 +397,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -515,13 +515,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +682,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1284,8 +1284,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4342ebdab4..18bd8296b8 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -468,7 +468,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -478,7 +478,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d1fe052abf..a97fbae7a8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,10 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,92 +14,59 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
 
 /* ----------
  * Total number of backends including auxiliary
@@ -132,27 +94,69 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
  * without needing to copy things around.  We assume this inits to zeroes.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_BgWriter BgWriterStats;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static HTAB *snapshot_db_stats;
+static MemoryContext stats_cxt;
 
-static struct sockaddr_storage pgStatAddr;
+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
@@ -189,18 +193,14 @@ typedef struct TabStatHashEntry
  * Hash table for O(1) t_id -> tsa_entry lookup
  */
 static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
 
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -237,6 +237,12 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
 /*
  * Info about current "snapshot" of stats file
  */
@@ -250,23 +256,15 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshots taken on backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,32 +278,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+/* functions used in backends */
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
@@ -316,320 +323,16 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
 /*
  * subroutine for pgstat_reset_all
  */
@@ -678,119 +381,54 @@ pgstat_reset_remove_files(const char *directory)
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
 }
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
+/* ----------
+ * pgstat_create_shared_stats() -
  *
- * Format up the arglist for, then fork and exec, statistics collector process
+ *    create shared stats memory
+ * ----------
  */
-static pid_t
-pgstat_forkexec(void)
+static void
+pgstat_create_shared_stats(void)
 {
-    char       *av[10];
-    int            ac = 0;
+    MemoryContext oldcontext;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
 
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
 
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
+    /* connect to the memory */
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
 
 /* ------------------------------------------------------------
  * Public functions used by backends follow
@@ -802,41 +440,107 @@ allow_immediate_pgstat_restart(void)
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
  */
-void
-pgstat_report_stat(bool force)
+long
+pgstat_update_stat(bool force)
 {
     /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
     static TimestampTz last_report = 0;
-
+    static TimestampTz oldest_pending = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+        
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
     last_report = now;
 
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
      * entries it points to.  (Should we fail partway through the loop below,
@@ -849,23 +553,55 @@ pgstat_report_stat(bool force)
     pgStatTabHash = NULL;
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
 
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            PgStat_TableStatus *pentry = NULL;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -878,178 +614,440 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+        
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
 {
-    int            n;
-    int            len;
+    Assert (deststat != srcstat);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
     }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
 }
-
+        
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
  */
 static void
-pgstat_send_funcstats(void)
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* find the shared function stats table */
+    if (!dbentry)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
 
-    have_function_stats = false;
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }    
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
 }
 
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1057,148 +1055,86 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
-
+    
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
-/* ----------
+/*
  * pgstat_collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
  */
 static HTAB *
 pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
@@ -1245,62 +1181,54 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
+
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
 }
 
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1309,20 +1237,51 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1331,29 +1290,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert(db_stats);
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1362,17 +1329,90 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
@@ -1386,48 +1426,75 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Repot about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1438,9 +1505,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1469,114 +1541,228 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
+static int pending_deadlocks = 0;
+
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgstat_pending_tempfile)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * clean up function for temporary files
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1692,9 +1878,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1716,6 +1899,15 @@ pgstat_initstats(Relation rel)
     Oid            rel_id = rel->rd_id;
     char        relkind = rel->rd_rel->relkind;
 
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
     /* We only count stats for things that have storage */
     if (!(relkind == RELKIND_RELATION ||
           relkind == RELKIND_MATVIEW ||
@@ -1727,13 +1919,6 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
     /*
      * If we already set up this relation in the current transaction, nothing
      * to do.
@@ -2377,34 +2562,6 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
         rec->tuples_inserted + rec->tuples_updated;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2417,47 +2574,28 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
@@ -2476,18 +2614,14 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
 
+    dshash_release_lock(db_stats, dbentry);
     return funcentry;
 }
 
@@ -2562,9 +2696,11 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2579,9 +2715,11 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2771,7 +2909,7 @@ pgstat_initialize(void)
     }
 
     /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2963,7 +3101,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
+        pgstat_update_stat(true);
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3230,7 +3368,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -4150,96 +4289,68 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_update_archiver() -
  *
- *        Set common header fields in a statistics message
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_update_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
 }
 
 /* ----------
- * pgstat_send_archiver() -
+ * pgstat_update_bgwriter() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Update bgwriter statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
+pgstat_update_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
      * collector.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4247,299 +4358,15 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4565,20 +4392,17 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
 }
 
 /*
@@ -4587,47 +4411,76 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 &lock_acquired);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4663,29 +4516,23 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4704,7 +4551,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4716,32 +4563,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4784,16 +4628,6 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
 /*
@@ -4801,15 +4635,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed >= len)
@@ -4827,10 +4660,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4839,9 +4672,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4868,23 +4702,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -4919,47 +4760,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
 
     /*
      * The tables will live in pgStatLocalContext.
@@ -4967,28 +4791,18 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5002,11 +4816,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        LWLockRelease(StatsLock);
+        return;
     }
 
     /*
@@ -5015,7 +4830,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5023,11 +4838,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5038,17 +4854,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5068,7 +4884,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5077,21 +4893,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5099,54 +4917,26 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5154,36 +4944,62 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     }
 
 done:
+    LWLockRelease(StatsLock);
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5194,7 +5010,10 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5208,7 +5027,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5221,7 +5040,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5241,7 +5060,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5253,19 +5072,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5275,7 +5096,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5287,19 +5108,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5309,7 +5131,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5319,276 +5141,290 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
+ * backend_clean_snapshot_callback() -
  *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother releasing memory in the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    if (arg != NULL)
     {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (snapshot_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            HASH_SEQ_STATUS seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            hash_seq_init(&seq, snapshot_db_stats);
+            while ((dbent = hash_seq_search(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    hash_destroy(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    hash_destroy(dbent->snapshot_functions);
+            }
+            hash_destroy(snapshot_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
 }
 
 /*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
  */
-static void
-backend_read_statsfile(void)
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    HTAB *result;
+    HASHCTL ctl;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = stats_cxt;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
 }
 
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in the current memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    void *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            Assert(hashname);
+            *dest = create_local_stats_hash(hashname, keysize, entrysize, 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+        if (!found)
+        {
+            dshash_table *t = dshash;
+            void *sentry;
+
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+            if (!sentry)
+            {
+                hash_search(*dest, &key, HASH_REMOVE, NULL);
+                if (!dshash)
+                    dshash_detach(t);
+                return NULL;
+            }
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+
+            if (!dshash)
+                dshash_detach(t);
+        }
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    MemoryContextCallback *mcxt_cb;
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+
+    /* Remember for stats memory allocation later */
+    stats_cxt = CurrentMemoryContext;
+
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
 
 /* ----------
  * pgstat_setup_memcxt() -
@@ -5619,6 +5455,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5628,717 +5466,112 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
+
+    /*
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
+     */
+    backend_clean_snapshot_callback(¶m);
 }
 
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Otherwise add the values to the existing entry.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
 static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
     /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
+     * Add per-table stats to the per-database entry, too.
      */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
+
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 65eab02b3e..dd293a79f0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1298,12 +1296,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1752,11 +1744,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2591,8 +2578,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2923,8 +2908,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2991,13 +2974,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3072,22 +3048,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3546,22 +3506,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3757,8 +3701,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3797,8 +3739,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3999,8 +3940,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4973,18 +4912,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5097,12 +5024,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5972,7 +5893,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6025,8 +5945,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6258,7 +6176,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index d87cf8afe5..9408f87614 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -128,7 +128,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -144,6 +144,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -525,7 +528,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -863,7 +866,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index de23ced9af..e4e2ad7b39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -493,7 +493,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1327,6 +1327,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..d8d0ad2487 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1990,7 +1990,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2098,7 +2098,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2288,7 +2288,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2296,7 +2296,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index cab7ae74ca..c7c248878a 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47d99..6417559cb0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -148,6 +148,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -279,8 +280,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..979478e2e5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0c0891b33e..8b9142461a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4225,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4233,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 053bb73863..6eac39fb57 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,7 +33,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1176,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1192,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1208,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1224,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1240,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1256,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1272,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1288,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1304,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1319,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1337,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1353,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1368,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1383,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1398,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1413,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1428,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1443,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1463,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1479,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1495,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1850,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1882,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7415c4faab..626a4326a4 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -629,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3e1c3863c4..25b3b2a079 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..49131a6d5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f299d1d601..1ad77fb20f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -41,32 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,271 +148,23 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
@@ -485,79 +205,6 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
  * Statistic collector data structures follow
  *
@@ -601,10 +248,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1136,13 +786,15 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1154,34 +806,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1192,6 +830,8 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
 
@@ -1219,6 +859,9 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1338,15 +981,15 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
@@ -1355,4 +998,14 @@ extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 7c44f4a6e7..c37ec33e9b 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..daa269f816 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index c1878dd694..7391e05f37 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From 60711e8bb25371b237835c0f78a64d139589c8c6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/7] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 src/backend/postmaster/pgstat.c               | 12 +++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 5 files changed, 11 insertions(+), 61 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a97fbae7a8..78f0bbb558 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -89,15 +89,11 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This was a GUC parameter and no longer used in this file. But left alone
+ * just for backward comptibility for extensions, having the default value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..58ba33e822 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..099afd0724 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -189,7 +189,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3973,17 +3972,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10966,35 +10954,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..a65656a4d2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -552,7 +552,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1ad77fb20f..d10ea5389b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3

From 5be4706bc70dc7eeaa36674d01cf3c409172172a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 9 Nov 2018 15:48:49 +0900
Subject: [PATCH 6/7] Split out backend status monitor part from pgstat

A large file, pgstat.c, contained two major facilities, backend status
monitor and database usage monitor. Split out the former part from the
file and name the module "bestatus". The names of individual functions
are left alone except for some conficts.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/heap/vacuumlazy.c               |    1 +
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    1 +
 src/backend/access/transam/xact.c                  |    1 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |    1 +
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    1 +
 src/backend/postmaster/checkpointer.c              |    1 +
 src/backend/postmaster/pgarch.c                    |    1 +
 src/backend/postmaster/postmaster.c                |    1 +
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |    6 +-
 src/backend/replication/logical/worker.c           |    7 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1756 ++++++++++++++++++++
 src/backend/{postmaster => statmon}/pgstat.c       | 1727 +------------------
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    2 +-
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |    1 +
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |    1 +
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |    4 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/include/bestatus.h                             |  544 ++++++
 src/include/pgstat.h                               |  514 +-----
 73 files changed, 2441 insertions(+), 2255 deletions(-)
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 rename src/backend/{postmaster => statmon}/pgstat.c (70%)
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 45a5a26337..6296401b25 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/heapam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f177ebaa2c..188d034387 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 239d220c24..1ea71245df 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 478a96db9b..cc511672c9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f6b0f1b093..ef40a2e7a2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c09eb6eff8..189db9b8fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..69cd211369 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5cc3cf57e2..a0173c19a8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -64,9 +64,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d83fa..cf034ba333 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 9c55c20d6b..26d30b8853 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a28fe474aa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..bbe9c0eb5f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..9bbb1952ac 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 18467d96d2..c40ac790b0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9e45581d89..4b4e3d07ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..683c41575f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..53fa4890e9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index df926d8dea..fca62770ac 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -22,6 +22,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -329,9 +330,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -341,6 +339,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -417,6 +418,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index d6cfd28ddc..a8d29d2d33 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index cd20abc141..3ad7238b5a 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -41,6 +41,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 70a4e90a05..02d58c463c 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 856daf6a7f..5a47eb4601 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2098708864..898a7916b0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 789a975409..de15e0907f 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index a7def3168d..fa1cf6cffa 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..f79a70d6fe 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a69ea230fb..b1c723bf1c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..7d7d55ef1a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 3fb6badea8..c820d35fbc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index d58193774e..b592560dd2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 18bd8296b8..2a7c4fd1b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -35,6 +35,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dd293a79f0..d3ec828657 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..3accdf7bcf 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a6fdba3f41..0de04159d5 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 58ba33e822..a567aacf73 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7027737e67..75a3208f74 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 2b0d889c3b..ab967d7d65 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index ca51318dbb..a5685b8e7e 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b79ce5db95..b90768be86 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4053482420..a30f1e012e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 9408f87614..fafef0578a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,26 +86,28 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index e4e2ad7b39..d8d7b35058 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -31,6 +31,8 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -42,17 +44,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..c60e69302a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6..02ec91d98e 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..bdca25499d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d2eb23eb7..3de752bd4c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..882a21c89a
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1756 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/statmon/pgstat.c
similarity index 70%
rename from src/backend/postmaster/pgstat.c
rename to src/backend/statmon/pgstat.c
index 78f0bbb558..7ec7a454c7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -8,7 +8,7 @@
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
- *    src/backend/postmaster/pgstat.c
+ *    src/backend/statmon/pgstat.c
  * ----------
  */
 #include "postgres.h"
@@ -21,19 +21,14 @@
 #include "access/htup_details.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "replication/walsender.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 
@@ -68,26 +63,12 @@ typedef enum
     PGSTAT_ENTRY_LOCK_FAILED
 } pg_stat_table_result_status;
 
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
 /* ----------
  * GUC parameters
  * ----------
  */
-bool        pgstat_track_activities = false;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
 
 /*
  * This was a GUC parameter and no longer used in this file. But left alone
@@ -131,6 +112,8 @@ static bool pgstat_pending_recoveryconflict = false;
 static bool pgstat_pending_deadlock = false;
 static bool pgstat_pending_tempfile = false;
 
+static MemoryContext pgStatLocalContext = NULL;
+
 /* dshash parameter for each type of table */
 static const dshash_parameters dsh_dbparams = {
     sizeof(Oid),
@@ -242,15 +225,8 @@ typedef struct
 /*
  * Info about current "snapshot" of stats file
  */
-static MemoryContext pgStatLocalContext = NULL;
 static HTAB *pgStatDBHash = NULL;
 
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
 /*
  * Cluster wide statistics.
  * Contains statistics that are not collected per database or per table.
@@ -286,7 +262,6 @@ static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dsha
 /* functions used in backends */
 static bool backend_snapshot_global_stats(void);
 static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
-static void pgstat_read_current_status(void);
 
 static void pgstat_postmaster_shutdown(int code, Datum arg);
 static void pgstat_apply_pending_tabstats(bool shared, bool force,
@@ -313,12 +288,6 @@ static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
 
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
 static bool pgstat_update_tabentry(dshash_table *tabhash,
                                    PgStat_TableStatus *stat, bool nowait);
 static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
@@ -329,6 +298,14 @@ static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
  * ------------------------------------------------------------
  */
 
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
 /*
  * subroutine for pgstat_reset_all
  */
@@ -490,7 +467,7 @@ pgstat_update_stat(bool force)
          */
         TimestampDifference(last_report, now, &secs, &usecs);
         elapsed = secs * 1000 + usecs /1000;
-        
+
         if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
         {
             /* we know we have some statistics */
@@ -746,7 +723,7 @@ pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
             pgStatBlockReadTime = 0;
             pgStatBlockWriteTime = 0;
         }
-        
+
         cxt->tabhash =
             dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
     }
@@ -806,7 +783,7 @@ pgstat_merge_tabentry(PgStat_TableStatus *deststat,
         dest->t_blocks_hit += src->t_blocks_hit;
     }
 }
-        
+
 /*
  * pgstat_update_funcstats: subroutine for pgstat_update_stat
  *
@@ -926,7 +903,7 @@ pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
                 hash_search(pgStatPendingFunctions,
                             (void *) &(pendent->functionid), HASH_REMOVE, NULL);
             }
-        }    
+        }
 
         /* destroy the hsah if no entry remains */
         if (hash_get_num_entries(pgStatPendingFunctions) == 0)
@@ -1064,7 +1041,7 @@ pgstat_vacuum_stat(void)
     dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
     if (!dbentry)
         return;
-    
+
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
@@ -2621,66 +2598,6 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     return funcentry;
 }
 
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
@@ -2718,364 +2635,6 @@ pgstat_fetch_global(void)
     return snapshot_globalStats;
 }
 
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    before_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
@@ -3088,8 +2647,6 @@ pgstat_bestart(void)
 static void
 pgstat_beshutdown_hook(int code, Datum arg)
 {
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
     /*
      * If we got as far as discovering our own database ID, we can report what
      * we did to the collector.  Otherwise, we'd be sending an invalid
@@ -3098,1188 +2655,9 @@ pgstat_beshutdown_hook(int code, Datum arg)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_update_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
 }
 
 
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(IsUnderPostmaster);
-
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
 /* ------------------------------------------------------------
  * Local support functions follow
  * ------------------------------------------------------------
@@ -5422,22 +3800,6 @@ backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
                               funcid);
 }
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5453,6 +3815,8 @@ pgstat_clear_snapshot(void)
 {
     int param = 0;    /* only the address is significant */
 
+    pgstat_bestatus_clear_snapshot();
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5460,8 +3824,6 @@ pgstat_clear_snapshot(void)
     /* Reset variables */
     pgStatLocalContext = NULL;
     pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
 
     /*
      * the parameter inform the function that it is not called from
@@ -5567,47 +3929,18 @@ pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
 }
 
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
+/* ----------
+ * pgstat_setup_memcxt() -
  *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
  */
-char *
-pgstat_clip_activity(const char *raw_activity)
+static void
+pgstat_setup_memcxt(void)
 {
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
 }
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d8d0ad2487..cb11dc6ffb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c2c445dbf4..0bb2132c71 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 1f766d20d1..a0401ee494 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..6bc5fd6089 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aeda32c9c5..e84275d4c2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 7da337d11f..97526f1c72 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 43110e57b6..9d88d8c023 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 6e471c3e43..cfa5c9089f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57a80..243da57c49 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 74eb449060..dd76088a29 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 979478e2e5..2cd4d5531e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a962034753..718232ae18 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 89c80fb687..c8198d7311 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c37dd1290b..a09d4f5313 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b9142461a..acbbef36a5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index f4d3eab2ea..0e3abeba36 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -21,6 +21,7 @@
 
 #include "access/heapam.h"
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -29,7 +30,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6eac39fb57..6054581fe4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 5e61d908fd..2dd99f935d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd2e4e89d8..1eabc0f41d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 626a4326a4..e07ca89065 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -26,6 +26,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -689,7 +690,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 099afd0724..0fd4db5cb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..3b47e9c063
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,544 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bestatus_clear_snapshot(void);
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d10ea5389b..6f4e94ab5b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,11 +14,8 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "lib/dshash.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -100,12 +97,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -173,10 +169,10 @@ typedef struct PgStat_BgWriter
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -209,7 +205,7 @@ typedef struct PgStat_FunctionEntry
 } PgStat_FunctionEntry;
 
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -313,7 +309,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -329,7 +325,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -347,422 +343,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -784,10 +364,8 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
 
 /* No longer used, but will be removed with GUC */
@@ -836,26 +414,9 @@ extern void pgstat_report_deadlock(void);
 extern void pgstat_clear_snapshot(void);
 
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
@@ -866,60 +427,6 @@ extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
 extern HTAB *backend_snapshot_all_db_entries(void);
 extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -987,6 +494,8 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_update_archiver(const char *xlog, bool failed);
 extern void pgstat_update_bgwriter(void);
 
+extern void pgstat_report_tempfile(size_t filesize);
+
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
@@ -994,10 +503,7 @@ extern void pgstat_update_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
-- 
2.16.3

From 824320b129e55f4111b033b42225ecdbe1576f7d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 4 Jul 2018 11:44:31 +0900
Subject: [PATCH 7/7] Documentation update

Remove all description on pg_stat_tmp directory from documentation.
---
 doc/src/sgml/backup.sgml     |  2 --
 doc/src/sgml/config.sgml     | 19 -------------------
 doc/src/sgml/monitoring.sgml |  7 +------
 doc/src/sgml/storage.sgml    |  3 +--
 4 files changed, 2 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..8a5291a18d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6671,25 +6671,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 60a85a7898..fa483ef0f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..e137e6b494 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Mon, 21 Jan 2019 21:19:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190121.211907.59625409.horiguchi.kyotaro@lab.ntt.co.jp>
> I'll reconsider the referer side of the stats.

The most significant cause of the slowdown is repeated search for
non-existent entries both on local and shared hash each time.

Negative cache in addition to cache expiration interval
eliminates the slowdown.

1000 times repetition with -O2 binary:

 master : 124.99 tps
 patched: 125.48 tps (+0.4%)

> I didn't merge the suggested two pairs of commits. I'll do that
> after adressing the slowdown issue.


I agree to commiting 0001-0003 separately. In the attached patch
set, old 0004+0006 are merged as 0004 and old 0005+0007 are
merged as new 0005.

Changed cacheing policy.
  Expired at every xact end
    -> Keeps at least for PGSTAT_STAT_MIN_INTERVAL (500ms).

Added negative cache feature (snapshot_statentry).

Improved separation between pgstat and bestat (separated AtEOXact_* functions).

Fixed doubious memory context usage.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



From 7149e93d7b41af0c7ce1cddc847a9bb7bc31b1e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..d1908a6137 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 8dafcc8293b856f42bc3a68fa792ea139fd8d0cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index d1908a6137..db8d6899af 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..fe1d4d75c5 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 90522c1de96ac84ba2ad7cc1ada47c7bb9f95e10 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 63bb134949..df926d8dea 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 13da412c59..d1fe052abf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2857,6 +2857,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4119,6 +4122,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3052bbbc21..65eab02b3e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1757,7 +1759,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2920,7 +2922,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3065,10 +3067,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3314,7 +3314,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3519,6 +3519,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3795,6 +3807,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5064,7 +5077,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5342,6 +5355,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 313ca5f3c3..f299d1d601 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From db229efcd159a0dfcc5d3420b7dcba7f918c8419 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/heap/vacuumlazy.c               |    1 +
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    2 +-
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    2 +
 src/backend/access/transam/xact.c                  |    3 +
 src/backend/access/transam/xlog.c                  |    5 +-
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    2 +-
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |   60 +-
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    5 +-
 src/backend/postmaster/checkpointer.c              |   25 +-
 src/backend/postmaster/pgarch.c                    |    5 +-
 src/backend/postmaster/pgstat.c                    | 6384 --------------------
 src/backend/postmaster/postmaster.c                |   86 +-
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |   15 +-
 src/backend/replication/logical/worker.c           |   11 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1779 ++++++
 src/backend/statmon/pgstat.c                       | 3935 ++++++++++++
 src/backend/storage/buffer/bufmgr.c                |    9 +-
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm.c                      |   24 +-
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/ipci.c                     |    6 +
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    5 +-
 src/backend/storage/lmgr/lwlocknames.txt           |    1 +
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |   28 +-
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |   51 +-
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/globals.c                   |    1 +
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |   15 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl       |    2 +-
 src/include/bestatus.h                             |  545 ++
 src/include/miscadmin.h                            |    2 +-
 src/include/pgstat.h                               |  951 +--
 src/include/storage/dsm.h                          |    3 +
 src/include/storage/lwlock.h                       |    3 +
 src/include/utils/timeout.h                        |    1 +
 src/test/modules/worker_spi/worker_spi.c           |    2 +-
 84 files changed, 6588 insertions(+), 7501 deletions(-)
 delete mode 100644 src/backend/postmaster/pgstat.c
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 create mode 100644 src/backend/statmon/pgstat.c
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 45a5a26337..6296401b25 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/heapam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f177ebaa2c..188d034387 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 239d220c24..1ea71245df 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 478a96db9b..cc511672c9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f6b0f1b093..ef40a2e7a2 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c09eb6eff8..189db9b8fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..69cd211369 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 5cc3cf57e2..a0173c19a8 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -64,9 +64,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d83fa..cf034ba333 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 9c55c20d6b..26d30b8853 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
@@ -29,7 +30,6 @@
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "storage/ipc.h"
 #include "storage/sinval.h"
 #include "storage/spin.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a28fe474aa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..bbe9c0eb5f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..0dc9f39424 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
@@ -1569,6 +1570,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     PredicateLockTwoPhaseFinish(xid, isCommit);
 
     /* Count the prepared xact as committed or aborted */
+    AtEOXact_BEStatus(isCommit);
     AtEOXact_PgStat(isCommit);
 
     /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 18467d96d2..837f7e2be6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -2147,6 +2148,7 @@ CommitTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_BEStatus(true);
     AtEOXact_PgStat(true);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
@@ -2641,6 +2643,7 @@ AbortTransaction(void)
         AtEOXact_Files(false);
         AtEOXact_ComboCid();
         AtEOXact_HashTables(false);
+        AtEOXact_BEStatus(false);
         AtEOXact_PgStat(false);
         AtEOXact_ApplyLauncher(false);
         pgstat_report_xact_timestamp(0);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..4b4e3d07ac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -8416,9 +8417,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..683c41575f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..53fa4890e9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index df926d8dea..fca62770ac 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -22,6 +22,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -329,9 +330,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -341,6 +339,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -417,6 +418,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index d6cfd28ddc..a8d29d2d33 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -48,7 +48,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index cd20abc141..3ad7238b5a 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -41,6 +41,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 70a4e90a05..02d58c463c 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
@@ -39,7 +40,6 @@
 #include "executor/tqueue.h"
 #include "miscadmin.h"
 #include "optimizer/planmain.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 856daf6a7f..5a47eb4601 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2098708864..898a7916b0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 789a975409..de15e0907f 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index a7def3168d..fa1cf6cffa 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..f79a70d6fe 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 4cf67873b1..b1c723bf1c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
@@ -969,7 +970,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -978,6 +979,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -993,7 +995,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1005,6 +1007,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1017,7 +1020,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1029,6 +1032,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1227,7 +1231,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,16 +1269,22 @@ do_start_worker(void)
                 break;
             }
         }
-        if (skipit)
-            continue;
+        if (!skipit)
+        {
+            /* Remember the db with oldest autovac time. */
+            if (avdb == NULL ||
+                tmp->adw_entry->last_autovac_time <
+                avdb->adw_entry->last_autovac_time)
+            {
+                if (avdb)
+                    pfree(avdb->adw_entry);
+                avdb = tmp;
+            }
+        }
 
-        /*
-         * Remember the db with oldest autovac time.  (If we are here, both
-         * tmp->entry and db->entry must be non-null.)
-         */
-        if (avdb == NULL ||
-            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-            avdb = tmp;
+        /* Immediately free it if not used */
+        if(avdb != tmp)
+            pfree(tmp->adw_entry);
     }
 
     /* Found a database -- process it */
@@ -1963,7 +1973,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2013,7 +2023,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = heap_open(RelationRelationId, AccessShareLock);
 
@@ -2099,6 +2109,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2178,10 +2190,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2750,12 +2763,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2787,8 +2798,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2819,6 +2830,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2909,7 +2922,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..7d7d55ef1a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..c820d35fbc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -267,9 +268,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..b592560dd2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -371,7 +372,7 @@ CheckpointerMain(void)
         {
             checkpoint_requested = false;
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
         if (shutdown_requested)
         {
@@ -397,7 +398,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -515,13 +516,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +683,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1284,8 +1285,8 @@ AbsorbFsyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4342ebdab4..2a7c4fd1b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -35,6 +35,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -468,7 +469,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -478,7 +479,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
deleted file mode 100644
index d1fe052abf..0000000000
--- a/src/backend/postmaster/pgstat.c
+++ /dev/null
@@ -1,6384 +0,0 @@
-/* ----------
- * pgstat.c
- *
- *    All the statistics collector stuff hacked up in one big, ugly file.
- *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
- *
- *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
- *
- *    src/backend/postmaster/pgstat.c
- * ----------
- */
-#include "postgres.h"
-
-#include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
-
-#include "pgstat.h"
-
-#include "access/heapam.h"
-#include "access/htup_details.h"
-#include "access/transam.h"
-#include "access/twophase_rmgr.h"
-#include "access/xact.h"
-#include "catalog/pg_database.h"
-#include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
-#include "miscadmin.h"
-#include "pg_trace.h"
-#include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
-#include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
-#include "storage/ipc.h"
-#include "storage/latch.h"
-#include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
-#include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
-#include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
-#include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-#include "utils/tqual.h"
-
-
-/* ----------
- * Timer definitions.
- * ----------
- */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
-
-
-/* ----------
- * The initial size hints for the hash tables used in the collector.
- * ----------
- */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
-#define PGSTAT_FUNCTION_HASH_SIZE    512
-
-
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
-/* ----------
- * GUC parameters
- * ----------
- */
-bool        pgstat_track_activities = false;
-bool        pgstat_track_counts = false;
-int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
-
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
-{
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
-
-/*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
- */
-typedef struct TabStatHashEntry
-{
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
-
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
-
-/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
- */
-static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
-/*
- * Tuple insertion/deletion counts for an open transaction can't be propagated
- * into PgStat_TableStatus counters until we know if it is going to commit
- * or abort.  Hence, we keep these counts in per-subxact structs that live
- * in TopTransactionContext.  This data structure is designed on the assumption
- * that subxacts won't usually modify very many tables.
- */
-typedef struct PgStat_SubXactStatus
-{
-    int            nest_level;        /* subtransaction nest level */
-    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
-    PgStat_TableXactStatus *first;    /* head of list for this subxact */
-} PgStat_SubXactStatus;
-
-static PgStat_SubXactStatus *pgStatXactStack = NULL;
-
-static int    pgStatXactCommit = 0;
-static int    pgStatXactRollback = 0;
-PgStat_Counter pgStatBlockReadTime = 0;
-PgStat_Counter pgStatBlockWriteTime = 0;
-
-/* Record that's written to 2PC state file when pgstat state is persisted */
-typedef struct TwoPhasePgStatRecord
-{
-    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
-    PgStat_Counter tuples_updated;    /* tuples updated in xact */
-    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
-    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
-    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
-    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
-    bool        t_truncated;    /* was the relation truncated? */
-} TwoPhasePgStatRecord;
-
-/*
- * Info about current "snapshot" of stats file
- */
-static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
-
-/*
- * Total time charged to functions so far in the current backend.
- * We use this to help separate "self" and "other" time charges.
- * (We assume this initializes to zero.)
- */
-static instr_time total_func_time;
-
-
-/* ----------
- * Local function forward declarations
- * ----------
- */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
-static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
-/* ------------------------------------------------------------
- * Public functions called from postmaster follow
- * ------------------------------------------------------------
- */
-
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
-/*
- * subroutine for pgstat_reset_all
- */
-static void
-pgstat_reset_remove_files(const char *directory)
-{
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
-
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
-}
-
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
-
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
-/* ----------
- * pgstat_vacuum_stat() -
- *
- *    Will tell the collector about objects he can get rid of.
- * ----------
- */
-void
-pgstat_vacuum_stat(void)
-{
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
-}
-
-
-/* ----------
- * pgstat_collect_oids() -
- *
- *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
- */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
-{
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
-    Relation    rel;
-    HeapScanDesc scan;
-    HeapTuple    tup;
-    Snapshot    snapshot;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    rel = heap_open(catalogid, AccessShareLock);
-    snapshot = RegisterSnapshot(GetLatestSnapshot());
-    scan = heap_beginscan(rel, snapshot, 0, NULL);
-    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
-    {
-        Oid            thisoid;
-        bool        isnull;
-
-        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
-        Assert(!isnull);
-
-        CHECK_FOR_INTERRUPTS();
-
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
-    }
-    heap_endscan(scan);
-    UnregisterSnapshot(snapshot);
-    heap_close(rel, AccessShareLock);
-
-    return htab;
-}
-
-
-/* ----------
- * pgstat_drop_database() -
- *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
- */
-void
-pgstat_drop_database(Oid databaseid)
-{
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_reset_counters() -
- *
- *    Tell the statistics collector to reset counters for our database.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_counters(void)
-{
-    PgStat_MsgResetcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_shared_counters() -
- *
- *    Tell the statistics collector to reset cluster-wide shared counters.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_shared_counters(const char *target)
-{
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("unrecognized reset target: \"%s\"", target),
-                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_single_counter() -
- *
- *    Tell the statistics collector to reset a single counter.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
-{
-    PgStat_MsgResetsinglecounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_report_autovac() -
- *
- *    Called from autovacuum.c to report startup of an autovacuum process.
- *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
- *    the db OID must be passed in, instead.
- * ----------
- */
-void
-pgstat_report_autovac(Oid dboid)
-{
-    PgStat_MsgAutovacStart msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ---------
- * pgstat_report_vacuum() -
- *
- *    Tell the collector about the table we just vacuumed.
- * ---------
- */
-void
-pgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
-{
-    PgStat_MsgVacuum msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_analyze() -
- *
- *    Tell the collector about the table we just analyzed.
- *
- * Caller must provide new live- and dead-tuples estimates, as well as a
- * flag indicating whether to reset the changes_since_analyze counter.
- * --------
- */
-void
-pgstat_report_analyze(Relation rel,
-                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
-                      bool resetcounter)
-{
-    PgStat_MsgAnalyze msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    /*
-     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
-     * already inserted and/or deleted rows in the target table. ANALYZE will
-     * have counted such rows as live or dead respectively. Because we will
-     * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
-     */
-    if (rel->pgstat_info != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
-        {
-            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
-        }
-        /* count stuff inserted by already-aborted subxacts, too */
-        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-        /* Since ANALYZE's counts are estimates, we could have underflowed */
-        livetuples = Max(livetuples, 0);
-        deadtuples = Max(deadtuples, 0);
-    }
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_recovery_conflict() -
- *
- *    Tell the collector about a Hot Standby recovery conflict.
- * --------
- */
-void
-pgstat_report_recovery_conflict(int reason)
-{
-    PgStat_MsgRecoveryConflict msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- *    Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
-    PgStat_MsgDeadlock msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_tempfile() -
- *
- *    Tell the collector about a temporary file.
- * --------
- */
-void
-pgstat_report_tempfile(size_t filesize)
-{
-    PgStat_MsgTempFile msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
-void
-pgstat_init_function_usage(FunctionCallInfoData *fcinfo,
-                           PgStat_FunctionCallUsage *fcu)
-{
-    PgStat_BackendFunctionEntry *htabent;
-    bool        found;
-
-    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
-    {
-        /* stats not wanted */
-        fcu->fs = NULL;
-        return;
-    }
-
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
-
-    fcu->fs = &htabent->f_counts;
-
-    /* save stats for this function, later used to compensate for recursion */
-    fcu->save_f_total_time = htabent->f_counts.f_total_time;
-
-    /* save current backend-wide total time */
-    fcu->save_total = total_func_time;
-
-    /* get clock time as of function start */
-    INSTR_TIME_SET_CURRENT(fcu->f_start);
-}
-
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
- *
- * If no entry, return NULL, don't create a new one
- */
-PgStat_BackendFunctionEntry *
-find_funcstat_entry(Oid func_id)
-{
-    if (pgStatFunctions == NULL)
-        return NULL;
-
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
-}
-
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
- *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
- */
-void
-pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
-{
-    PgStat_FunctionCounts *fs = fcu->fs;
-    instr_time    f_total;
-    instr_time    f_others;
-    instr_time    f_self;
-
-    /* stats not wanted? */
-    if (fs == NULL)
-        return;
-
-    /* total elapsed time in this function call */
-    INSTR_TIME_SET_CURRENT(f_total);
-    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
-
-    /* self usage: elapsed minus anything already charged to other calls */
-    f_others = total_func_time;
-    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
-    f_self = f_total;
-    INSTR_TIME_SUBTRACT(f_self, f_others);
-
-    /* update backend-wide total time */
-    INSTR_TIME_ADD(total_func_time, f_self);
-
-    /*
-     * Compute the new f_total_time as the total elapsed time added to the
-     * pre-call value of f_total_time.  This is necessary to avoid
-     * double-counting any time taken by recursive calls of myself.  (We do
-     * not need any similar kluge for self time, since that already excludes
-     * any recursive calls.)
-     */
-    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
-
-    /* update counters in function stats table */
-    if (finalize)
-        fs->f_numcalls++;
-    fs->f_total_time = f_total;
-    INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
-}
-
-
-/* ----------
- * pgstat_initstats() -
- *
- *    Initialize a relcache entry to count access statistics.
- *    Called whenever a relation is opened.
- *
- *    We assume that a relcache entry's pgstat_info field is zeroed by
- *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
- * ----------
- */
-void
-pgstat_initstats(Relation rel)
-{
-    Oid            rel_id = rel->rd_id;
-    char        relkind = rel->rd_rel->relkind;
-
-    /* We only count stats for things that have storage */
-    if (!(relkind == RELKIND_RELATION ||
-          relkind == RELKIND_MATVIEW ||
-          relkind == RELKIND_INDEX ||
-          relkind == RELKIND_TOASTVALUE ||
-          relkind == RELKIND_SEQUENCE))
-    {
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    /*
-     * If we already set up this relation in the current transaction, nothing
-     * to do.
-     */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
-        return;
-
-    /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
-}
-
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
- */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
-{
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
-}
-
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
- *
- * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
- */
-PgStat_TableStatus *
-find_tabstat_entry(Oid rel_id)
-{
-    TabStatHashEntry *hash_entry;
-
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
-
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
-}
-
-/*
- * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
- */
-static PgStat_SubXactStatus *
-get_tabstat_stack_level(int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state == NULL || xact_state->nest_level != nest_level)
-    {
-        xact_state = (PgStat_SubXactStatus *)
-            MemoryContextAlloc(TopTransactionContext,
-                               sizeof(PgStat_SubXactStatus));
-        xact_state->nest_level = nest_level;
-        xact_state->prev = pgStatXactStack;
-        xact_state->first = NULL;
-        pgStatXactStack = xact_state;
-    }
-    return xact_state;
-}
-
-/*
- * add_tabstat_xact_level - add a new (sub)transaction state record
- */
-static void
-add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-    PgStat_TableXactStatus *trans;
-
-    /*
-     * If this is the first rel to be modified at the current nest level, we
-     * first have to push a transaction stack entry.
-     */
-    xact_state = get_tabstat_stack_level(nest_level);
-
-    /* Now make a per-table stack entry */
-    trans = (PgStat_TableXactStatus *)
-        MemoryContextAllocZero(TopTransactionContext,
-                               sizeof(PgStat_TableXactStatus));
-    trans->nest_level = nest_level;
-    trans->upper = pgstat_info->trans;
-    trans->parent = pgstat_info;
-    trans->next = xact_state->first;
-    xact_state->first = trans;
-    pgstat_info->trans = trans;
-}
-
-/*
- * pgstat_count_heap_insert - count a tuple insertion of n tuples
- */
-void
-pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_inserted += n;
-    }
-}
-
-/*
- * pgstat_count_heap_update - count a tuple update
- */
-void
-pgstat_count_heap_update(Relation rel, bool hot)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_updated++;
-
-        /* t_tuples_hot_updated is nontransactional, so just advance it */
-        if (hot)
-            pgstat_info->t_counts.t_tuples_hot_updated++;
-    }
-}
-
-/*
- * pgstat_count_heap_delete - count a tuple deletion
- */
-void
-pgstat_count_heap_delete(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_deleted++;
-    }
-}
-
-/*
- * pgstat_truncate_save_counters
- *
- * Whenever a table is truncated, we save its i/u/d counters so that they can
- * be cleared, and if the (sub)xact that executed the truncate later aborts,
- * the counters can be restored to the saved (pre-truncate) values.  Note we do
- * this on the first truncate in any particular subxact level only.
- */
-static void
-pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
-{
-    if (!trans->truncated)
-    {
-        trans->inserted_pre_trunc = trans->tuples_inserted;
-        trans->updated_pre_trunc = trans->tuples_updated;
-        trans->deleted_pre_trunc = trans->tuples_deleted;
-        trans->truncated = true;
-    }
-}
-
-/*
- * pgstat_truncate_restore_counters - restore counters when a truncate aborts
- */
-static void
-pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
-{
-    if (trans->truncated)
-    {
-        trans->tuples_inserted = trans->inserted_pre_trunc;
-        trans->tuples_updated = trans->updated_pre_trunc;
-        trans->tuples_deleted = trans->deleted_pre_trunc;
-    }
-}
-
-/*
- * pgstat_count_truncate - update tuple counters due to truncate
- */
-void
-pgstat_count_truncate(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_truncate_save_counters(pgstat_info->trans);
-        pgstat_info->trans->tuples_inserted = 0;
-        pgstat_info->trans->tuples_updated = 0;
-        pgstat_info->trans->tuples_deleted = 0;
-    }
-}
-
-/*
- * pgstat_update_heap_dead_tuples - update dead-tuples count
- *
- * The semantics of this are that we are reporting the nontransactional
- * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
- * rather than increasing, and the change goes straight into the per-table
- * counter, not into transactional state.
- */
-void
-pgstat_update_heap_dead_tuples(Relation rel, int delta)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
-}
-
-
-/* ----------
- * AtEOXact_PgStat
- *
- *    Called from access/transam/xact.c at top-level transaction commit/abort.
- * ----------
- */
-void
-AtEOXact_PgStat(bool isCommit)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Count transaction commit or abort.  (We use counters, not just bools,
-     * in case the reporting message isn't sent right away.)
-     */
-    if (isCommit)
-        pgStatXactCommit++;
-    else
-        pgStatXactRollback++;
-
-    /*
-     * Transfer transactional insert/update counts into the base tabstat
-     * entries.  We don't bother to free any of the transactional state, since
-     * it's all in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            /* restore pre-truncate stats (if any) in case of aborted xact */
-            if (!isCommit)
-                pgstat_truncate_restore_counters(trans);
-            /* count attempted actions regardless of commit/abort */
-            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-            if (isCommit)
-            {
-                tabstat->t_counts.t_truncated = trans->truncated;
-                if (trans->truncated)
-                {
-                    /* forget live/dead stats seen by backend thus far */
-                    tabstat->t_counts.t_delta_live_tuples = 0;
-                    tabstat->t_counts.t_delta_dead_tuples = 0;
-                }
-                /* insert adds a live tuple, delete removes one */
-                tabstat->t_counts.t_delta_live_tuples +=
-                    trans->tuples_inserted - trans->tuples_deleted;
-                /* update and delete each create a dead tuple */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_updated + trans->tuples_deleted;
-                /* insert, update, delete each count as one change event */
-                tabstat->t_counts.t_changed_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated +
-                    trans->tuples_deleted;
-            }
-            else
-            {
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                /* an aborted xact generates no changed_tuple events */
-            }
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/* ----------
- * AtEOSubXact_PgStat
- *
- *    Called from access/transam/xact.c at subtransaction commit/abort.
- * ----------
- */
-void
-AtEOSubXact_PgStat(bool isCommit, int nestDepth)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Transfer transactional insert/update counts into the next higher
-     * subtransaction state.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL &&
-        xact_state->nest_level >= nestDepth)
-    {
-        PgStat_TableXactStatus *trans;
-        PgStat_TableXactStatus *next_trans;
-
-        /* delink xact_state from stack immediately to simplify reuse case */
-        pgStatXactStack = xact_state->prev;
-
-        for (trans = xact_state->first; trans != NULL; trans = next_trans)
-        {
-            PgStat_TableStatus *tabstat;
-
-            next_trans = trans->next;
-            Assert(trans->nest_level == nestDepth);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            if (isCommit)
-            {
-                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
-                {
-                    if (trans->truncated)
-                    {
-                        /* propagate the truncate status one level up */
-                        pgstat_truncate_save_counters(trans->upper);
-                        /* replace upper xact stats with ours */
-                        trans->upper->tuples_inserted = trans->tuples_inserted;
-                        trans->upper->tuples_updated = trans->tuples_updated;
-                        trans->upper->tuples_deleted = trans->tuples_deleted;
-                    }
-                    else
-                    {
-                        trans->upper->tuples_inserted += trans->tuples_inserted;
-                        trans->upper->tuples_updated += trans->tuples_updated;
-                        trans->upper->tuples_deleted += trans->tuples_deleted;
-                    }
-                    tabstat->trans = trans->upper;
-                    pfree(trans);
-                }
-                else
-                {
-                    /*
-                     * When there isn't an immediate parent state, we can just
-                     * reuse the record instead of going through a
-                     * palloc/pfree pushup (this works since it's all in
-                     * TopTransactionContext anyway).  We have to re-link it
-                     * into the parent level, though, and that might mean
-                     * pushing a new entry into the pgStatXactStack.
-                     */
-                    PgStat_SubXactStatus *upper_xact_state;
-
-                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
-                    trans->next = upper_xact_state->first;
-                    upper_xact_state->first = trans;
-                    trans->nest_level = nestDepth - 1;
-                }
-            }
-            else
-            {
-                /*
-                 * On abort, update top-level tabstat counts, then forget the
-                 * subtransaction
-                 */
-
-                /* first restore values obliterated by truncate */
-                pgstat_truncate_restore_counters(trans);
-                /* count attempted actions regardless of commit/abort */
-                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                tabstat->trans = trans->upper;
-                pfree(trans);
-            }
-        }
-        pfree(xact_state);
-    }
-}
-
-
-/*
- * AtPrepare_PgStat
- *        Save the transactional stats state at 2PC transaction prepare.
- *
- * In this phase we just generate 2PC records for all the pending
- * transaction-dependent stats work.
- */
-void
-AtPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-            TwoPhasePgStatRecord record;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-
-            record.tuples_inserted = trans->tuples_inserted;
-            record.tuples_updated = trans->tuples_updated;
-            record.tuples_deleted = trans->tuples_deleted;
-            record.inserted_pre_trunc = trans->inserted_pre_trunc;
-            record.updated_pre_trunc = trans->updated_pre_trunc;
-            record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
-            record.t_truncated = trans->truncated;
-
-            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
-                                   &record, sizeof(TwoPhasePgStatRecord));
-        }
-    }
-}
-
-/*
- * PostPrepare_PgStat
- *        Clean up after successful PREPARE.
- *
- * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
- *
- * Note: AtEOXact_PgStat is not called during PREPARE.
- */
-void
-PostPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * We don't bother to free any of the transactional state, since it's all
-     * in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            tabstat = trans->parent;
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/*
- * 2PC processing routine for COMMIT PREPARED case.
- *
- * Load the saved counts into our local pgstats state.
- */
-void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
-                           void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, commit case */
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_truncated = rec->t_truncated;
-    if (rec->t_truncated)
-    {
-        /* forget live/dead stats seen by backend thus far */
-        pgstat_info->t_counts.t_delta_live_tuples = 0;
-        pgstat_info->t_counts.t_delta_dead_tuples = 0;
-    }
-    pgstat_info->t_counts.t_delta_live_tuples +=
-        rec->tuples_inserted - rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_updated + rec->tuples_deleted;
-    pgstat_info->t_counts.t_changed_tuples +=
-        rec->tuples_inserted + rec->tuples_updated +
-        rec->tuples_deleted;
-}
-
-/*
- * 2PC processing routine for ROLLBACK PREPARED case.
- *
- * Load the saved counts into our local pgstats state, but treat them
- * as aborted.
- */
-void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
-                          void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, abort case */
-    if (rec->t_truncated)
-    {
-        rec->tuples_inserted = rec->inserted_pre_trunc;
-        rec->tuples_updated = rec->updated_pre_trunc;
-        rec->tuples_deleted = rec->deleted_pre_trunc;
-    }
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_inserted + rec->tuples_updated;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
-/* ----------
- * pgstat_fetch_stat_tabentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
- *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatTabEntry *
-pgstat_fetch_stat_tabentry(Oid relid)
-{
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    /*
-     * If we didn't find it, maybe it's a shared table.
-     */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_funcentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one function or NULL.
- * ----------
- */
-PgStat_StatFuncEntry *
-pgstat_fetch_stat_funcentry(Oid func_id)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
-/*
- * ---------
- * pgstat_fetch_stat_archiver() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
- * ---------
- */
-PgStat_ArchiverStats *
-pgstat_fetch_stat_archiver(void)
-{
-    backend_read_statsfile();
-
-    return &archiverStats;
-}
-
-
-/*
- * ---------
- * pgstat_fetch_global() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
- * ---------
- */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
-{
-    backend_read_statsfile();
-
-    return &globalStats;
-}
-
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
-/*
- * Shut down a single backend's statistics reporting at process exit.
- *
- * Flush any remaining statistics counts out to the collector.
- * Without this, operations triggered during backend exit (such as
- * temp table deletions) won't be counted.
- *
- * Lastly, clear out our entry in the PgBackendStatus array.
- */
-static void
-pgstat_beshutdown_hook(int code, Datum arg)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    /*
-     * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
-     * database ID, so forget it.  (This means that accesses to pg_database
-     * during failed backend starts might never get counted.)
-     */
-    if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(!pgStatRunningInCollector);
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
-/* ------------------------------------------------------------
- * Local support functions follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
-
-    /*
-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
-     */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
-        return;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
-
-    /*
-     * Clear out the statistics buffer, so it can be re-used.
-     */
-    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
- *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
- */
-char *
-pgstat_clip_activity(const char *raw_activity)
-{
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
-}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 65eab02b3e..d3ec828657 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
@@ -255,7 +256,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1298,12 +1297,6 @@ PostmasterMain(int argc, char *argv[])
 
     whereToSendOutput = DestNone;
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1752,11 +1745,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2591,8 +2579,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2923,8 +2909,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2991,13 +2975,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3072,22 +3049,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3546,22 +3507,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3757,8 +3702,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3797,8 +3740,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -3999,8 +3941,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4973,18 +4913,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5097,12 +5025,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5972,7 +5894,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6025,8 +5946,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6258,7 +6177,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..3accdf7bcf 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a6fdba3f41..0de04159d5 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..e30b2dbcf0 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7027737e67..75a3208f74 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 2b0d889c3b..ab967d7d65 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index ca51318dbb..a5685b8e7e 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index b79ce5db95..b90768be86 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -61,10 +61,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 4053482420..a30f1e012e 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index d87cf8afe5..fafef0578a 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,26 +86,28 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -128,7 +130,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -144,6 +146,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -525,7 +530,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -863,7 +868,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index de23ced9af..d8d7b35058 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -31,6 +31,8 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -42,17 +44,20 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 
+#include "funcapi.h"
+
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 
 #include "mb/pg_wchar.h"
+#include "miscadmin.h"
 
 #include "nodes/makefuncs.h"
 
 #include "optimizer/planner.h"
 
 #include "parser/parse_relation.h"
-
+#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/walwriter.h"
@@ -493,7 +498,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1327,6 +1332,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..c60e69302a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6..02ec91d98e 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..bdca25499d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2d2eb23eb7..3de752bd4c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..2a251ae1b5
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1779 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void bestatus_clear_snapshot(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peerdn_name(MyProcPort, beentry->st_sslstatus->ssl_clientdn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * AtEOXact_BEStatus
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_BEStatus(bool isCommit)
+{
+    bestatus_clear_snapshot();
+}
+
+/*
+ * AtPrepare_BEStatus
+ *        Clear existing snapshot at 2PC transaction prepare.
+ */
+void
+AtPrepare_BEStatus(void)
+{
+    bestatus_clear_snapshot();
+}
+
+/* ----------
+ * bestatus_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+static void
+bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
new file mode 100644
index 0000000000..c8513186db
--- /dev/null
+++ b/src/backend/statmon/pgstat.c
@@ -0,0 +1,3935 @@
+/* ----------
+ * pgstat.c
+ *
+ *    Statistics collector facility.
+ *
+ *    Statistics data is stored in dynamic shared memory. Every backends
+ *    updates and read it individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/pgstat.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "pgstat.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase_rmgr.h"
+#include "access/xact.h"
+#include "bestatus.h"
+#include "catalog/pg_database.h"
+#include "catalog/pg_proc.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+/* ----------
+ * Timer definitions.
+ * ----------
+ */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
+
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
+
+/* ----------
+ * The initial size hints for the hash tables used in the collector.
+ * ----------
+ */
+#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_FUNCTION_HASH_SIZE    512
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} pg_stat_table_result_status;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_counts = false;
+int            pgstat_track_functions = TRACK_FUNC_OFF;
+
+/* ----------
+ * Built from GUC parameter
+ * ----------
+ */
+char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
+char       *pgstat_stat_filename = NULL;
+char       *pgstat_stat_tmpname = NULL;
+
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
+
+/*
+ * BgWriter global statistics counters (unused in other processes).
+ * Stored directly in a stats message structure so it can be sent
+ * without needing to copy things around.  We assume this inits to zeroes.
+ */
+PgStat_BgWriter BgWriterStats;
+
+/* ----------
+ * Local data
+ * ----------
+ */
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+
+/* memory context for snapshots */
+static MemoryContext pgStatLocalContext = NULL;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
+/*
+ *  Variables in which backends store cluster-wide stats that's waiting to be
+ *  written to shared stats.
+ */
+static bool pgstat_pending_recoveryconflict = false;
+static bool pgstat_pending_deadlock = false;
+static bool pgstat_pending_tempfile = false;
+
+/*
+ * Structures in which backends store per-table info that's waiting to be
+ * written to shared stats.
+ *
+ * NOTE: once allocated, TabStatusArray structures are never moved or deleted
+ * for the life of the backend.  Also, we zero out the t_id fields of the
+ * contained PgStat_TableStatus structs whenever they are not actively in use.
+ * This allows relcache pgstat_info pointers to be treated as long-lived data,
+ * avoiding repeated searches in pgstat_initstats() when a relation is
+ * repeatedly opened during a transaction.
+ */
+#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
+
+typedef struct TabStatusArray
+{
+    struct TabStatusArray *tsa_next;    /* link to next array, if any */
+    int            tsa_used;        /* # entries currently used */
+    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
+} TabStatusArray;
+
+static TabStatusArray *pgStatTabList = NULL;
+
+/*
+ * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ */
+typedef struct TabStatHashEntry
+{
+    Oid            t_id;
+    PgStat_TableStatus *tsa_entry;
+} TabStatHashEntry;
+
+/*
+ * Hash table for O(1) t_id -> tsa_entry lookup
+ */
+static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
+
+/*
+ * Backends store per-function info that's waiting to be sent to the collector
+ * in this hash table (indexed by function OID).
+ */
+static HTAB *pgStatFunctions = NULL;
+static HTAB *pgStatPendingFunctions = NULL;
+
+/*
+ * Tuple insertion/deletion counts for an open transaction can't be propagated
+ * into PgStat_TableStatus counters until we know if it is going to commit
+ * or abort.  Hence, we keep these counts in per-subxact structs that live
+ * in TopTransactionContext.  This data structure is designed on the assumption
+ * that subxacts won't usually modify very many tables.
+ */
+typedef struct PgStat_SubXactStatus
+{
+    int            nest_level;        /* subtransaction nest level */
+    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
+    PgStat_TableXactStatus *first;    /* head of list for this subxact */
+} PgStat_SubXactStatus;
+
+static PgStat_SubXactStatus *pgStatXactStack = NULL;
+
+static int    pgStatXactCommit = 0;
+static int    pgStatXactRollback = 0;
+PgStat_Counter pgStatBlockReadTime = 0;
+PgStat_Counter pgStatBlockWriteTime = 0;
+
+/* Record that's written to 2PC state file when pgstat state is persisted */
+typedef struct TwoPhasePgStatRecord
+{
+    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
+    PgStat_Counter tuples_updated;    /* tuples updated in xact */
+    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
+    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
+    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
+    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
+    Oid            t_id;            /* table's OID */
+    bool        t_shared;        /* is it a shared catalog? */
+    bool        t_truncated;    /* was the relation truncated? */
+} TwoPhasePgStatRecord;
+
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
+/*
+ * Info about current snapshot of stats
+ */
+TimestampTz backend_cache_expire = 0; /* local cache expiration time */
+bool        first_in_xact = true;      /* first fetch after the last tr end */
+
+/*
+ * Cluster wide statistics.
+
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by statistics collector code and
+ * snapshot_* are cached stats for the reader code.
+ */
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
+/*
+ * Total time charged to functions so far in the current backend.
+ * We use this to help separate "self" and "other" time charges.
+ * (We assume this initializes to zero.)
+ */
+static instr_time total_func_time;
+
+
+/* ----------
+ * Local function forward declarations
+ * ----------
+ */
+/* functions used in backends */
+static void pgstat_beshutdown_hook(int code, Datum arg);
+
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    pg_stat_table_result_status *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
+
+static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
+static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_setup_memcxt(void);
+
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
+
+/* ------------------------------------------------------------
+ * Public functions called from postmaster follow
+ * ------------------------------------------------------------
+ */
+
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * subroutine for pgstat_reset_all
+ */
+static void
+pgstat_reset_remove_files(const char *directory)
+{
+    DIR           *dir;
+    struct dirent *entry;
+    char        fname[MAXPGPATH * 2];
+
+    dir = AllocateDir(directory);
+    while ((entry = ReadDir(dir, directory)) != NULL)
+    {
+        int            nchars;
+        Oid            tmp_oid;
+
+        /*
+         * Skip directory entries that don't match the file names we write.
+         * See get_dbstat_filename for the database-specific pattern.
+         */
+        if (strncmp(entry->d_name, "global.", 7) == 0)
+            nchars = 7;
+        else
+        {
+            nchars = 0;
+            (void) sscanf(entry->d_name, "db_%u.%n",
+                          &tmp_oid, &nchars);
+            if (nchars <= 0)
+                continue;
+            /* %u allows leading whitespace, so reject that */
+            if (strchr("0123456789", entry->d_name[3]) == NULL)
+                continue;
+        }
+
+        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
+            strcmp(entry->d_name + nchars, "stat") != 0)
+            continue;
+
+        snprintf(fname, sizeof(fname), "%s/%s", directory,
+                 entry->d_name);
+        unlink(fname);
+    }
+    FreeDir(dir);
+}
+
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    MemoryContextSwitchTo(pgStatLocalContext);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+}
+
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some
+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular
+ *  backends happen with the interval not shorter than
+ *  PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *  Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only when not within a transaction, so it is fair
+ *    to use transaction stop time as an approximation of current time.
+ *    ----------
+ */
+long
+pgstat_update_stat(bool force)
+{
+    /* we assume this inits to all zeroes: */
+    static TimestampTz last_report = 0;
+    static TimestampTz oldest_pending = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgstat_pending_recoveryconflict ||
+        pgstat_pending_deadlock ||
+        pgstat_pending_tempfile ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    last_report = now;
+
+    /* Publish report time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_report)
+        StatsShmem->last_update = last_report;
+    LWLockRelease(StatsLock);
+    
+
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /* Forecibly update other stats if any. */
+    if (other_pending_stats)
+    {
+        cxt.dbentry =
+            pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+        /* clean up pending statistics if any */
+        if (pgStatPendingFunctions)
+            pgstat_update_funcstats(true, cxt.dbentry);
+        if (pgstat_pending_recoveryconflict)
+            pgstat_cleanup_recovery_conflict(cxt.dbentry);
+        if (pgstat_pending_deadlock)
+            pgstat_cleanup_deadlock(cxt.dbentry);
+        if (pgstat_pending_tempfile)
+            pgstat_cleanup_tempfile(cxt.dbentry);
+    }
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    /*
+     * XX: We cannot lock two dshash entries at once. Since we must keep lock
+     * while tables stats are being updated we have no choice other than
+     * separating jobs for shared table stats and that of egular tables.
+     * Looping over the array twice isapparently ineffcient and more efficient
+     * way is expected.
+     */
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
+ * If force is true waits until required locks to be acquired. Elsewise stats
+ * merged stats as pending sats and it will be processed in the next chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
+
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            PgStat_TableStatus *pentry = NULL;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) == 0)
+                continue;
+
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
+            {
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry && entry != pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
+            }
+        }
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /* apply database stats  */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
+    }
+
+    /*
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
+     */
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
+
+    return updated;
+}
+
+/*
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
+ */
+static void
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
+{
+    Assert (deststat != srcstat);
+
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
+    else
+    {
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
+    }
+}
+
+/*
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
+ */
+static void
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    pg_stat_table_result_status status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
+
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
+        return;
+
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
+
+    /* find the shared function stats table */
+    if (!dbentry)
+    {
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
+    }
+
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* create hash if not yet */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
+
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *    Remove objects he can get rid of.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
+
+    /*
+     * Read pg_database and make a list of OIDs of all existing databases
+     */
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+
+    /*
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
+     */
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            dbid = dbentry->databaseid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /* the DB entry for shared tables (with InvalidOid) is never dropped */
+        if (OidIsValid(dbid) &&
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            pgstat_drop_database(dbid);
+    }
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Lookup our own database entry; if not found, nothing more to do.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
+        return;
+
+    /*
+     * Similarly to above, make a list of all known relations in this DB.
+     */
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+
+    /*
+     * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
+     */
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            tabid = tabentry->tableid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
+    }
+    dshash_detach(dshtable);
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Now repeat the above steps for functions.  However, we needn't bother
+     * in the common case where no function stats are being collected.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
+        {
+            Oid            funcid = funcentry->functionid;
+
+            CHECK_FOR_INTERRUPTS();
+
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+                continue;
+
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
+        }
+
+        hash_destroy(oidtab);
+
+        dshash_detach(dshtable);
+    }
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/*
+ * pgstat_collect_oids() -
+ *
+ *    Collect the OIDs of all objects listed in the specified system catalog
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
+ */
+static HTAB *
+pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+{
+    HTAB       *htab;
+    HASHCTL        hash_ctl;
+    Relation    rel;
+    HeapScanDesc scan;
+    HeapTuple    tup;
+    Snapshot    snapshot;
+
+    memset(&hash_ctl, 0, sizeof(hash_ctl));
+    hash_ctl.keysize = sizeof(Oid);
+    hash_ctl.entrysize = sizeof(Oid);
+    hash_ctl.hcxt = CurrentMemoryContext;
+    htab = hash_create("Temporary table of OIDs",
+                       PGSTAT_TAB_HASH_SIZE,
+                       &hash_ctl,
+                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    rel = heap_open(catalogid, AccessShareLock);
+    snapshot = RegisterSnapshot(GetLatestSnapshot());
+    scan = heap_beginscan(rel, snapshot, 0, NULL);
+    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
+    {
+        Oid            thisoid;
+        bool        isnull;
+
+        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
+        Assert(!isnull);
+
+        CHECK_FOR_INTERRUPTS();
+
+        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+    }
+    heap_endscan(scan);
+    UnregisterSnapshot(snapshot);
+    heap_close(rel, AccessShareLock);
+
+    return htab;
+}
+
+
+/* ----------
+ * pgstat_drop_database() -
+ *
+ *    Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
+ * ----------
+ */
+
+void
+pgstat_drop_database(Oid databaseid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
+}
+
+
+/* ----------
+ * pgstat_reset_counters() -
+ *
+ *    Reset counters for our database.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_counters(void)
+{
+    PgStat_StatDBEntry           *dbentry;
+    pg_stat_table_result_status status;
+
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* ----------
+ * pgstat_reset_shared_counters() -
+ *
+ *    Reset cluster-wide shared counters.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_shared_counters(const char *target)
+{
+    Assert(db_stats);
+
+    /* Reset the archiver statistics for the cluster. */
+    if (strcmp(target, "archiver") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else if (strcmp(target, "bgwriter") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("unrecognized reset target: \"%s\"", target),
+                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_reset_single_counter() -
+ *
+ *    Reset a single counter.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
+{
+    PgStat_StatDBEntry *dbentry;
+    
+
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_report_autovac() -
+ *
+ *    Called from autovacuum.c to report startup of an autovacuum process.
+ *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
+ *    the db OID must be passed in, instead.
+ * ----------
+ */
+void
+pgstat_report_autovac(Oid dboid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/* ---------
+ * pgstat_report_vacuum() -
+ *
+ *    Repot about the table we just vacuumed.
+ * ---------
+ */
+void
+pgstat_report_vacuum(Oid tableoid, bool shared,
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_analyze() -
+ *
+ *    Report about the table we just analyzed.
+ *
+ * Caller must provide new live- and dead-tuples estimates, as well as a
+ * flag indicating whether to reset the changes_since_analyze counter.
+ * --------
+ */
+void
+pgstat_report_analyze(Relation rel,
+                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                      bool resetcounter)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
+     * already inserted and/or deleted rows in the target table. ANALYZE will
+     * have counted such rows as live or dead respectively. Because we will
+     * report our counts of such rows at transaction end, we should subtract
+     * off these counts from what we send to the collector now, else they'll
+     * be double-counted after commit.  (This approach also ensures that the
+     * collector ends up with the right numbers if we abort instead of
+     * committing.)
+     */
+    if (rel->pgstat_info != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+        {
+            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+        }
+        /* count stuff inserted by already-aborted subxacts, too */
+        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+        /* Since ANALYZE's counts are estimates, we could have underflowed */
+        livetuples = Max(livetuples, 0);
+        deadtuples = Max(deadtuples, 0);
+    }
+
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_recovery_conflict() -
+ *
+ *    Report a Hot Standby recovery conflict.
+ * --------
+ */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
+void
+pgstat_report_recovery_conflict(int reason)
+{
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_pending_recoveryconflict = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgstat_pending_recoveryconflict = false;
+}
+
+/* --------
+ * pgstat_report_deadlock() -
+ *
+ *    Report a deadlock detected.
+ * --------
+ */
+static int pending_deadlocks = 0;
+
+void
+pgstat_report_deadlock(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pending_deadlocks++;
+    pgstat_pending_deadlock = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgstat_pending_deadlock = false;
+}
+
+/* --------
+ * pgstat_report_tempfile() -
+ *
+ *    Report a temporary file.
+ * --------
+ */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
+void
+pgstat_report_tempfile(size_t filesize)
+{
+    PgStat_StatDBEntry *dbentry;
+    pg_stat_table_result_status status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgstat_pending_tempfile = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+
+    if (!pgstat_pending_tempfile)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for temporary files
+ */
+static void
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
+{
+
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgstat_pending_tempfile = false;
+
+}
+
+/*
+ * Initialize function call usage data.
+ * Called by the executor before invoking a function.
+ */
+void
+pgstat_init_function_usage(FunctionCallInfoData *fcinfo,
+                           PgStat_FunctionCallUsage *fcu)
+{
+    PgStat_BackendFunctionEntry *htabent;
+    bool        found;
+
+    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
+    {
+        /* stats not wanted */
+        fcu->fs = NULL;
+        return;
+    }
+
+    if (!pgStatFunctions)
+    {
+        /* First time through - initialize function stat table */
+        HASHCTL        hash_ctl;
+
+        memset(&hash_ctl, 0, sizeof(hash_ctl));
+        hash_ctl.keysize = sizeof(Oid);
+        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
+        pgStatFunctions = hash_create("Function stat entries",
+                                      PGSTAT_FUNCTION_HASH_SIZE,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Get the stats entry for this function, create if necessary */
+    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
+                          HASH_ENTER, &found);
+    if (!found)
+        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+
+    fcu->fs = &htabent->f_counts;
+
+    /* save stats for this function, later used to compensate for recursion */
+    fcu->save_f_total_time = htabent->f_counts.f_total_time;
+
+    /* save current backend-wide total time */
+    fcu->save_total = total_func_time;
+
+    /* get clock time as of function start */
+    INSTR_TIME_SET_CURRENT(fcu->f_start);
+}
+
+/*
+ * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
+ *        for specified function
+ *
+ * If no entry, return NULL, don't create a new one
+ */
+PgStat_BackendFunctionEntry *
+find_funcstat_entry(Oid func_id)
+{
+    if (pgStatFunctions == NULL)
+        return NULL;
+
+    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
+                                                       (void *) &func_id,
+                                                       HASH_FIND, NULL);
+}
+
+/*
+ * Calculate function call usage and update stat counters.
+ * Called by the executor after invoking a function.
+ *
+ * In the case of a set-returning function that runs in value-per-call mode,
+ * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ * calls for what the user considers a single call of the function.  The
+ * finalize flag should be TRUE on the last call.
+ */
+void
+pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
+{
+    PgStat_FunctionCounts *fs = fcu->fs;
+    instr_time    f_total;
+    instr_time    f_others;
+    instr_time    f_self;
+
+    /* stats not wanted? */
+    if (fs == NULL)
+        return;
+
+    /* total elapsed time in this function call */
+    INSTR_TIME_SET_CURRENT(f_total);
+    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
+
+    /* self usage: elapsed minus anything already charged to other calls */
+    f_others = total_func_time;
+    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
+    f_self = f_total;
+    INSTR_TIME_SUBTRACT(f_self, f_others);
+
+    /* update backend-wide total time */
+    INSTR_TIME_ADD(total_func_time, f_self);
+
+    /*
+     * Compute the new f_total_time as the total elapsed time added to the
+     * pre-call value of f_total_time.  This is necessary to avoid
+     * double-counting any time taken by recursive calls of myself.  (We do
+     * not need any similar kluge for self time, since that already excludes
+     * any recursive calls.)
+     */
+    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
+
+    /* update counters in function stats table */
+    if (finalize)
+        fs->f_numcalls++;
+    fs->f_total_time = f_total;
+    INSTR_TIME_ADD(fs->f_self_time, f_self);
+}
+
+
+/* ----------
+ * pgstat_initstats() -
+ *
+ *    Initialize a relcache entry to count access statistics.
+ *    Called whenever a relation is opened.
+ *
+ *    We assume that a relcache entry's pgstat_info field is zeroed by
+ *    relcache.c when the relcache entry is made; thereafter it is long-lived
+ *    data.  We can avoid repeated searches of the TabStatus arrays when the
+ *    same relation is touched repeatedly within a transaction.
+ * ----------
+ */
+void
+pgstat_initstats(Relation rel)
+{
+    Oid            rel_id = rel->rd_id;
+    char        relkind = rel->rd_rel->relkind;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /* We only count stats for things that have storage */
+    if (!(relkind == RELKIND_RELATION ||
+          relkind == RELKIND_MATVIEW ||
+          relkind == RELKIND_INDEX ||
+          relkind == RELKIND_TOASTVALUE ||
+          relkind == RELKIND_SEQUENCE))
+    {
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /*
+     * If we already set up this relation in the current transaction, nothing
+     * to do.
+     */
+    if (rel->pgstat_info != NULL &&
+        rel->pgstat_info->t_id == rel_id)
+        return;
+
+    /* Else find or make the PgStat_TableStatus entry, and update link */
+    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+}
+
+/*
+ * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ */
+static PgStat_TableStatus *
+get_tabstat_entry(Oid rel_id, bool isshared)
+{
+    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *entry;
+    TabStatusArray *tsa;
+    bool        found;
+
+    /*
+     * Create hash table if we don't have it already.
+     */
+    if (pgStatTabHash == NULL)
+    {
+        HASHCTL        ctl;
+
+        memset(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(Oid);
+        ctl.entrysize = sizeof(TabStatHashEntry);
+
+        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+    }
+
+    /*
+     * Find an entry or create a new one.
+     */
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    if (!found)
+    {
+        /* initialize new entry with null pointer */
+        hash_entry->tsa_entry = NULL;
+    }
+
+    /*
+     * If entry is already valid, we're done.
+     */
+    if (hash_entry->tsa_entry)
+        return hash_entry->tsa_entry;
+
+    /*
+     * Locate the first pgStatTabList entry with free space, making a new list
+     * entry if needed.  Note that we could get an OOM failure here, but if so
+     * we have left the hashtable and the list in a consistent state.
+     */
+    if (pgStatTabList == NULL)
+    {
+        /* Set up first pgStatTabList entry */
+        pgStatTabList = (TabStatusArray *)
+            MemoryContextAllocZero(TopMemoryContext,
+                                   sizeof(TabStatusArray));
+    }
+
+    tsa = pgStatTabList;
+    while (tsa->tsa_used >= TABSTAT_QUANTUM)
+    {
+        if (tsa->tsa_next == NULL)
+            tsa->tsa_next = (TabStatusArray *)
+                MemoryContextAllocZero(TopMemoryContext,
+                                       sizeof(TabStatusArray));
+        tsa = tsa->tsa_next;
+    }
+
+    /*
+     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
+     * the entry was already zeroed, either at creation or after last use.
+     */
+    entry = &tsa->tsa_entries[tsa->tsa_used++];
+    entry->t_id = rel_id;
+    entry->t_shared = isshared;
+
+    /*
+     * Now we can fill the entry in pgStatTabHash.
+     */
+    hash_entry->tsa_entry = entry;
+
+    return entry;
+}
+
+/*
+ * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+ *
+ * If no entry, return NULL, don't create a new one
+ *
+ * Note: if we got an error in the most recent execution of pgstat_report_stat,
+ * it's possible that an entry exists but there's no hashtable entry for it.
+ * That's okay, we'll treat this case as "doesn't exist".
+ */
+PgStat_TableStatus *
+find_tabstat_entry(Oid rel_id)
+{
+    TabStatHashEntry *hash_entry;
+
+    /* If hashtable doesn't exist, there are no entries at all */
+    if (!pgStatTabHash)
+        return NULL;
+
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
+    if (!hash_entry)
+        return NULL;
+
+    /* Note that this step could also return NULL, but that's correct */
+    return hash_entry->tsa_entry;
+}
+
+/*
+ * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
+ */
+static PgStat_SubXactStatus *
+get_tabstat_stack_level(int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state == NULL || xact_state->nest_level != nest_level)
+    {
+        xact_state = (PgStat_SubXactStatus *)
+            MemoryContextAlloc(TopTransactionContext,
+                               sizeof(PgStat_SubXactStatus));
+        xact_state->nest_level = nest_level;
+        xact_state->prev = pgStatXactStack;
+        xact_state->first = NULL;
+        pgStatXactStack = xact_state;
+    }
+    return xact_state;
+}
+
+/*
+ * add_tabstat_xact_level - add a new (sub)transaction state record
+ */
+static void
+add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+    PgStat_TableXactStatus *trans;
+
+    /*
+     * If this is the first rel to be modified at the current nest level, we
+     * first have to push a transaction stack entry.
+     */
+    xact_state = get_tabstat_stack_level(nest_level);
+
+    /* Now make a per-table stack entry */
+    trans = (PgStat_TableXactStatus *)
+        MemoryContextAllocZero(TopTransactionContext,
+                               sizeof(PgStat_TableXactStatus));
+    trans->nest_level = nest_level;
+    trans->upper = pgstat_info->trans;
+    trans->parent = pgstat_info;
+    trans->next = xact_state->first;
+    xact_state->first = trans;
+    pgstat_info->trans = trans;
+}
+
+/*
+ * pgstat_count_heap_insert - count a tuple insertion of n tuples
+ */
+void
+pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_inserted += n;
+    }
+}
+
+/*
+ * pgstat_count_heap_update - count a tuple update
+ */
+void
+pgstat_count_heap_update(Relation rel, bool hot)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_updated++;
+
+        /* t_tuples_hot_updated is nontransactional, so just advance it */
+        if (hot)
+            pgstat_info->t_counts.t_tuples_hot_updated++;
+    }
+}
+
+/*
+ * pgstat_count_heap_delete - count a tuple deletion
+ */
+void
+pgstat_count_heap_delete(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_deleted++;
+    }
+}
+
+/*
+ * pgstat_truncate_save_counters
+ *
+ * Whenever a table is truncated, we save its i/u/d counters so that they can
+ * be cleared, and if the (sub)xact that executed the truncate later aborts,
+ * the counters can be restored to the saved (pre-truncate) values.  Note we do
+ * this on the first truncate in any particular subxact level only.
+ */
+static void
+pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
+{
+    if (!trans->truncated)
+    {
+        trans->inserted_pre_trunc = trans->tuples_inserted;
+        trans->updated_pre_trunc = trans->tuples_updated;
+        trans->deleted_pre_trunc = trans->tuples_deleted;
+        trans->truncated = true;
+    }
+}
+
+/*
+ * pgstat_truncate_restore_counters - restore counters when a truncate aborts
+ */
+static void
+pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
+{
+    if (trans->truncated)
+    {
+        trans->tuples_inserted = trans->inserted_pre_trunc;
+        trans->tuples_updated = trans->updated_pre_trunc;
+        trans->tuples_deleted = trans->deleted_pre_trunc;
+    }
+}
+
+/*
+ * pgstat_count_truncate - update tuple counters due to truncate
+ */
+void
+pgstat_count_truncate(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_truncate_save_counters(pgstat_info->trans);
+        pgstat_info->trans->tuples_inserted = 0;
+        pgstat_info->trans->tuples_updated = 0;
+        pgstat_info->trans->tuples_deleted = 0;
+    }
+}
+
+/*
+ * pgstat_update_heap_dead_tuples - update dead-tuples count
+ *
+ * The semantics of this are that we are reporting the nontransactional
+ * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
+ * rather than increasing, and the change goes straight into the per-table
+ * counter, not into transactional state.
+ */
+void
+pgstat_update_heap_dead_tuples(Relation rel, int delta)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
+}
+
+
+/* ----------
+ * AtEOXact_PgStat
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_PgStat(bool isCommit)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Count transaction commit or abort.  (We use counters, not just bools,
+     * in case the reporting message isn't sent right away.)
+     */
+    if (isCommit)
+        pgStatXactCommit++;
+    else
+        pgStatXactRollback++;
+
+    /*
+     * Transfer transactional insert/update counts into the base tabstat
+     * entries.  We don't bother to free any of the transactional state, since
+     * it's all in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            /* restore pre-truncate stats (if any) in case of aborted xact */
+            if (!isCommit)
+                pgstat_truncate_restore_counters(trans);
+            /* count attempted actions regardless of commit/abort */
+            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+            if (isCommit)
+            {
+                tabstat->t_counts.t_truncated = trans->truncated;
+                if (trans->truncated)
+                {
+                    /* forget live/dead stats seen by backend thus far */
+                    tabstat->t_counts.t_delta_live_tuples = 0;
+                    tabstat->t_counts.t_delta_dead_tuples = 0;
+                }
+                /* insert adds a live tuple, delete removes one */
+                tabstat->t_counts.t_delta_live_tuples +=
+                    trans->tuples_inserted - trans->tuples_deleted;
+                /* update and delete each create a dead tuple */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_updated + trans->tuples_deleted;
+                /* insert, update, delete each count as one change event */
+                tabstat->t_counts.t_changed_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated +
+                    trans->tuples_deleted;
+            }
+            else
+            {
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                /* an aborted xact generates no changed_tuple events */
+            }
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
+}
+
+/* ----------
+ * AtEOSubXact_PgStat
+ *
+ *    Called from access/transam/xact.c at subtransaction commit/abort.
+ * ----------
+ */
+void
+AtEOSubXact_PgStat(bool isCommit, int nestDepth)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Transfer transactional insert/update counts into the next higher
+     * subtransaction state.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL &&
+        xact_state->nest_level >= nestDepth)
+    {
+        PgStat_TableXactStatus *trans;
+        PgStat_TableXactStatus *next_trans;
+
+        /* delink xact_state from stack immediately to simplify reuse case */
+        pgStatXactStack = xact_state->prev;
+
+        for (trans = xact_state->first; trans != NULL; trans = next_trans)
+        {
+            PgStat_TableStatus *tabstat;
+
+            next_trans = trans->next;
+            Assert(trans->nest_level == nestDepth);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            if (isCommit)
+            {
+                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
+                {
+                    if (trans->truncated)
+                    {
+                        /* propagate the truncate status one level up */
+                        pgstat_truncate_save_counters(trans->upper);
+                        /* replace upper xact stats with ours */
+                        trans->upper->tuples_inserted = trans->tuples_inserted;
+                        trans->upper->tuples_updated = trans->tuples_updated;
+                        trans->upper->tuples_deleted = trans->tuples_deleted;
+                    }
+                    else
+                    {
+                        trans->upper->tuples_inserted += trans->tuples_inserted;
+                        trans->upper->tuples_updated += trans->tuples_updated;
+                        trans->upper->tuples_deleted += trans->tuples_deleted;
+                    }
+                    tabstat->trans = trans->upper;
+                    pfree(trans);
+                }
+                else
+                {
+                    /*
+                     * When there isn't an immediate parent state, we can just
+                     * reuse the record instead of going through a
+                     * palloc/pfree pushup (this works since it's all in
+                     * TopTransactionContext anyway).  We have to re-link it
+                     * into the parent level, though, and that might mean
+                     * pushing a new entry into the pgStatXactStack.
+                     */
+                    PgStat_SubXactStatus *upper_xact_state;
+
+                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
+                    trans->next = upper_xact_state->first;
+                    upper_xact_state->first = trans;
+                    trans->nest_level = nestDepth - 1;
+                }
+            }
+            else
+            {
+                /*
+                 * On abort, update top-level tabstat counts, then forget the
+                 * subtransaction
+                 */
+
+                /* first restore values obliterated by truncate */
+                pgstat_truncate_restore_counters(trans);
+                /* count attempted actions regardless of commit/abort */
+                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                tabstat->trans = trans->upper;
+                pfree(trans);
+            }
+        }
+        pfree(xact_state);
+    }
+}
+
+
+/*
+ * AtPrepare_PgStat
+ *        Save the transactional stats state at 2PC transaction prepare.
+ *
+ * In this phase we just generate 2PC records for all the pending
+ * transaction-dependent stats work.
+ */
+void
+AtPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+            TwoPhasePgStatRecord record;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+
+            record.tuples_inserted = trans->tuples_inserted;
+            record.tuples_updated = trans->tuples_updated;
+            record.tuples_deleted = trans->tuples_deleted;
+            record.inserted_pre_trunc = trans->inserted_pre_trunc;
+            record.updated_pre_trunc = trans->updated_pre_trunc;
+            record.deleted_pre_trunc = trans->deleted_pre_trunc;
+            record.t_id = tabstat->t_id;
+            record.t_shared = tabstat->t_shared;
+            record.t_truncated = trans->truncated;
+
+            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
+                                   &record, sizeof(TwoPhasePgStatRecord));
+        }
+    }
+}
+
+/*
+ * PostPrepare_PgStat
+ *        Clean up after successful PREPARE.
+ *
+ * All we need do here is unlink the transaction stats state from the
+ * nontransactional state.  The nontransactional action counts will be
+ * reported to the stats collector immediately, while the effects on live
+ * and dead tuple counts are preserved in the 2PC state file.
+ *
+ * Note: AtEOXact_PgStat is not called during PREPARE.
+ */
+void
+PostPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * We don't bother to free any of the transactional state, since it's all
+     * in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            tabstat = trans->parent;
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+}
+
+/*
+ * 2PC processing routine for COMMIT PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state.
+ */
+void
+pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+                           void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, commit case */
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_truncated = rec->t_truncated;
+    if (rec->t_truncated)
+    {
+        /* forget live/dead stats seen by backend thus far */
+        pgstat_info->t_counts.t_delta_live_tuples = 0;
+        pgstat_info->t_counts.t_delta_dead_tuples = 0;
+    }
+    pgstat_info->t_counts.t_delta_live_tuples +=
+        rec->tuples_inserted - rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_updated + rec->tuples_deleted;
+    pgstat_info->t_counts.t_changed_tuples +=
+        rec->tuples_inserted + rec->tuples_updated +
+        rec->tuples_deleted;
+}
+
+/*
+ * 2PC processing routine for ROLLBACK PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state, but treat them
+ * as aborted.
+ */
+void
+pgstat_twophase_postabort(TransactionId xid, uint16 info,
+                          void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, abort case */
+    if (rec->t_truncated)
+    {
+        rec->tuples_inserted = rec->inserted_pre_trunc;
+        rec->tuples_updated = rec->updated_pre_trunc;
+        rec->tuples_deleted = rec->deleted_pre_trunc;
+    }
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_inserted + rec->tuples_updated;
+}
+
+/* ----------
+ * pgstat_fetch_stat_tabentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    that the table doesn't exist, it is just not yet known by the
+ *    collector, so the caller is better off to report ZERO instead.
+ * ----------
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry(Oid relid)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    /*
+     * If we didn't find it, maybe it's a shared table.
+     */
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    return NULL;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_funcentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one function or NULL.
+ * ----------
+ */
+PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry(Oid func_id)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatFuncEntry *funcentry = NULL;
+
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+
+    dshash_release_lock(db_stats, dbentry);
+    return funcentry;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_archiver() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the archiver statistics struct.
+ * ---------
+ */
+PgStat_ArchiverStats *
+pgstat_fetch_stat_archiver(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_archiverStats;
+}
+
+
+/*
+ * ---------
+ * pgstat_fetch_global() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the global statistics struct.
+ * ---------
+ */
+PgStat_GlobalStats *
+pgstat_fetch_global(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_globalStats;
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    /*
+     * If we got as far as discovering our own database ID, we can report what
+     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * database ID, so forget it.  (This means that accesses to pg_database
+     * during failed backend starts might never get counted.)
+     */
+    if (OidIsValid(MyDatabaseId))
+        pgstat_update_stat(true);
+}
+
+
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+
+/* ----------
+ * pgstat_update_archiver() -
+ *
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
+ * ----------
+ */
+void
+pgstat_update_archiver(const char *xlog, bool failed)
+{
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+}
+
+/* ----------
+ * pgstat_update_bgwriter() -
+ *
+ *        Update bgwriter statistics
+ * ----------
+ */
+void
+pgstat_update_bgwriter(void)
+{
+    /* We assume this initializes to zeroes */
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+
+    /*
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid sending a completely empty message to the stats
+     * collector.
+     */
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
+
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
+ *
+ * Tables and functions hashes are initialized to empty.
+ */
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
+{
+    dshash_table *tbl;
+
+    dbentry->n_xact_commit = 0;
+    dbentry->n_xact_rollback = 0;
+    dbentry->n_blocks_fetched = 0;
+    dbentry->n_blocks_hit = 0;
+    dbentry->n_tuples_returned = 0;
+    dbentry->n_tuples_fetched = 0;
+    dbentry->n_tuples_inserted = 0;
+    dbentry->n_tuples_updated = 0;
+    dbentry->n_tuples_deleted = 0;
+    dbentry->last_autovac_time = 0;
+    dbentry->n_conflict_tablespace = 0;
+    dbentry->n_conflict_lock = 0;
+    dbentry->n_conflict_snapshot = 0;
+    dbentry->n_conflict_bufferpin = 0;
+    dbentry->n_conflict_startup_deadlock = 0;
+    dbentry->n_temp_files = 0;
+    dbentry->n_temp_bytes = 0;
+    dbentry->n_deadlocks = 0;
+    dbentry->n_block_read_time = 0;
+    dbentry->n_block_write_time = 0;
+
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+    dbentry->stats_timestamp = 0;
+
+
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
+}
+
+/*
+ * Lookup the hash table entry for the specified database. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    pg_stat_table_result_status *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 &lock_acquired);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
+
+    return result;
+}
+
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
+{
+    PgStat_StatTabEntry *result;
+    bool        found;
+
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
+
+    if (!create && !found)
+        return NULL;
+
+    /* If not found, initialize the new one. */
+    if (!found)
+    {
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
+    }
+
+    return result;
+}
+
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_db_statsfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid,
+                    char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *        Write the stat file for a single database.
+ *
+ *    If writing to the permanent file (happens when the collector is
+ *    shutting down only), remove the temporary file so that backends
+ *    starting up under a new postmaster can't read the old data before
+ *    the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgStatLocalContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stas area
+     */
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        LWLockRelease(StatsLock);
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
+                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
+                if (found)
+                {
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
+
+                /*
+                 * In the collector, disregard the timestamp we read from the
+                 * permanent stats file; we should be willing to write a temp
+                 * stats file immediately upon the first request from any
+                 * backend.
+                 */
+                dbentry->stats_timestamp = 0;
+
+                /*
+                 * If requested, read the data from the database-specific
+                 * file.  Otherwise we just leave the hashtables empty.
+                 */
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    LWLockRelease(StatsLock);
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if table data not wanted.
+                 */
+                if (tabhash == NULL)
+                    break;
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if function data not wanted.
+                 */
+                if (funchash == NULL)
+                    break;
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_clean_snapshot: clean up the local cache that will cause new
+ * snapshots to bo read.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    Assert(pgStatLocalContext);
+    MemoryContextReset(pgStatLocalContext);
+
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
+}
+
+/*
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
+ */
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
+{
+    HTAB *result;
+    HASHCTL ctl;
+
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = pgStatLocalContext;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
+}
+
+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in caller's memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+        bool *negative;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            /* make room for negative flag at the end of entry */
+            *dest = create_local_stats_hash(hashname, keysize,
+                                            entrysize + sizeof(bool), 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+
+        /* negative flag is placed at the end of the entry */
+        negative = (bool *) (lentry + entrysize);
+
+        if (!found)
+        {
+            /* not found in local cache, search shared hash */
+
+            dshash_table *t = dshash;
+            void *sentry;
+
+            /* attach shared hash if not given */
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+
+            if (sentry)
+            {
+                memcpy(lentry, sentry, entrysize);
+                dshash_release_lock(t, sentry);
+            }
+
+            *negative = !sentry;
+
+            /* Release it if we attached it here */
+            if (!dshash)
+                dshash_detach(t);
+
+            if (!sentry)
+                return NULL;
+        }
+
+        if (*negative)
+            lentry = NULL;
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return (void *) lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    TimestampTz update_time = 0;
+        
+
+    /*
+     * This is the first call in a transaction. If we find the shared stats
+     * updated, throw away the cache.
+     */
+    if (IsTransactionState() && first_in_xact)
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (backend_cache_expire < update_time)
+        {
+            pgstat_clear_snapshot();
+
+            /*
+             * Shared stats are updated frequently when many backends are
+             * running, but we don't want the cached stats to be expired so
+             * frequently. Keep them at least for the same duration with
+             * minimul stats update interval of a backend.
+             */
+            backend_cache_expire =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+    
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(pgStatLocalContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory in caller's context.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Activity statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
+}
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..cb11dc6ffb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
@@ -1990,7 +1991,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2098,7 +2099,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2288,7 +2289,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2296,7 +2297,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c2c445dbf4..0bb2132c71 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 1f766d20d1..a0401ee494 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..6bc5fd6089 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index cab7ae74ca..c7c248878a 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aeda32c9c5..e84275d4c2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47d99..6417559cb0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -148,6 +148,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -279,8 +280,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 7da337d11f..97526f1c72 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 43110e57b6..9d88d8c023 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 6e471c3e43..cfa5c9089f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57a80..243da57c49 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 74eb449060..dd76088a29 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..2cd4d5531e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a962034753..718232ae18 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -193,8 +193,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 89c80fb687..c8198d7311 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index c37dd1290b..a09d4f5313 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0c0891b33e..acbbef36a5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -3159,6 +3160,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3733,6 +3740,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4181,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4226,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4234,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index f4d3eab2ea..0e3abeba36 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -21,6 +21,7 @@
 
 #include "access/heapam.h"
 #include "access/sysattr.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -29,7 +30,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 053bb73863..6054581fe4 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -33,7 +34,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1176,7 +1177,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1192,7 +1193,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1208,7 +1209,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1224,7 +1225,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1240,7 +1241,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1256,7 +1257,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1272,7 +1273,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1288,7 +1289,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1304,7 +1305,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1319,7 +1320,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1337,7 +1338,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1353,7 +1354,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1368,7 +1369,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1383,7 +1384,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1398,7 +1399,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1413,7 +1414,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1428,7 +1429,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1443,7 +1444,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1463,7 +1464,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1479,7 +1480,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1495,7 +1496,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1850,6 +1851,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1879,6 +1883,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 5e61d908fd..2dd99f935d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd2e4e89d8..1eabc0f41d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7415c4faab..e07ca89065 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -26,6 +26,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -73,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -629,6 +631,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -686,7 +690,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
@@ -1240,6 +1247,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..7fe54b0669 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3e1c3863c4..25b3b2a079 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..3f44595198
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,545 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
+    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+extern void AtEOXact_BEStatus(bool isCommit);
+extern void AtPrepare_BEStatus(void);
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..49131a6d5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f299d1d601..746d1d0986 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,11 +13,9 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -129,12 +94,11 @@ typedef enum PgStat_Single_Reset_Type
     RESET_FUNCTION
 } PgStat_Single_Reset_Type;
 
+
 /* ------------------------------------------------------------
  * Structures kept in backend local memory while accumulating counts
  * ------------------------------------------------------------
  */
-
-
 /* ----------
  * PgStat_TableStatus            Per-table status within a backend
  *
@@ -180,280 +144,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time; /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +201,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -601,10 +244,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -660,7 +306,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +322,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -694,422 +340,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_cipher[NAMEDATALEN];    /* MUST be null-terminated */
-    char        ssl_clientdn[NAMEDATALEN];    /* MUST be null-terminated */
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -1131,18 +361,18 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1154,34 +384,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1192,87 +408,21 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern HTAB *backend_snapshot_all_db_entries(void);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* nontransactional event counts are simple enough to inline */
 
@@ -1338,21 +488,30 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
+
+extern void pgstat_report_tempfile(size_t filesize);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 7c44f4a6e7..c37ec33e9b 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..daa269f816 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index c1878dd694..7391e05f37 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From bfefcad9f70e10a118f7722d1dcf47b285c37b1c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/statmon/pgstat.c                  | 13 ++++-----
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..8a5291a18d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6671,25 +6671,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 60a85a7898..fa483ef0f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac8010..e137e6b494 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index e30b2dbcf0..a567aacf73 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -231,11 +231,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -266,13 +263,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
index c8513186db..d7fd4c8fa5 100644
--- a/src/backend/statmon/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -70,15 +70,12 @@ typedef enum
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 7fe54b0669..0fd4db5cb8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -190,7 +190,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3974,17 +3973,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -10967,35 +10955,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..a65656a4d2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -552,7 +552,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 746d1d0986..6f4e94ab5b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -28,7 +28,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Michael Paquier
Дата:
On Tue, Jan 22, 2019 at 03:48:02PM +0900, Kyotaro HORIGUCHI wrote:
> Fixed doubious memory context usage.

That's quite something that we have here for 0005:
84 files changed, 6588 insertions(+), 7501 deletions(-)

Moved to next CF for now.
--
Michael

Вложения

Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 7eed5866d2..e52ae54821 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
>                          &sync_secs, &sync_usecs);
>  
>      /* Accumulate checkpoint timing summary data, in milliseconds. */
> -    BgWriterStats.m_checkpoint_write_time +=
> +    BgWriterStats.checkpoint_write_time +=
>          write_secs * 1000 + write_usecs / 1000;
> -    BgWriterStats.m_checkpoint_sync_time +=
> +    BgWriterStats.checkpoint_sync_time +=
>          sync_secs * 1000 + sync_usecs / 1000;

Why does this patch do renames like this in the same entry as actual
functional changes?


> @@ -1273,16 +1276,22 @@ do_start_worker(void)
>                  break;
>              }
>          }
> -        if (skipit)
> -            continue;
> +        if (!skipit)
> +        {
> +            /* Remember the db with oldest autovac time. */
> +            if (avdb == NULL ||
> +                tmp->adw_entry->last_autovac_time <
> +                avdb->adw_entry->last_autovac_time)
> +            {
> +                if (avdb)
> +                    pfree(avdb->adw_entry);
> +                avdb = tmp;
> +            }
> +        }
>  
> -        /*
> -         * Remember the db with oldest autovac time.  (If we are here, both
> -         * tmp->entry and db->entry must be non-null.)
> -         */
> -        if (avdb == NULL ||
> -            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
> -            avdb = tmp;
> +        /* Immediately free it if not used */
> +        if(avdb != tmp)
> +            pfree(tmp->adw_entry);
>      }

This looks like another unrelated refactoring. Don't do this.


>      /* Transfer stats counts into pending pgstats message */
> -    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
> -    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
> +    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
> +    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;

More unrelated renaming.  I'll stop mentioning this in the rest of this
patch, but really don't do this, makes such a large unnecessary harder
to review.


pgstat.c needs a header comment explaining the architecture of the
approach.


> +/*
> + * Operation mode of pgstat_get_db_entry.
> + */
> +#define PGSTAT_FETCH_SHARED    0
> +#define PGSTAT_FETCH_EXCLUSIVE    1
> +#define    PGSTAT_FETCH_NOWAIT 2
> +
> +typedef enum
> +{

Please don't create anonymous enums that are then typedef'd to a
name. The underlying name and the one not scoped to typedefs shoudl be
the same.

> +/*
> + *  report withholding facility.
> + *
> + *  some report items are withholded if required lock is not acquired
> + *  immediately.
> + */

This comment needs polishing. The variables are named _pending_, but the
comment talks about withholding - which doesn't seem like an apt name.

>  /*
>   * Structures in which backends store per-table info that's waiting to be
> @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
>   * Hash table for O(1) t_id -> tsa_entry lookup
>   */
>  static HTAB *pgStatTabHash = NULL;
> +static HTAB *pgStatPendingTabHash = NULL;
>  
>  /*
>   * Backends store per-function info that's waiting to be sent to the collector
>   * in this hash table (indexed by function OID).
>   */
>  static HTAB *pgStatFunctions = NULL;
> -
> -/*
> - * Indicates if backend has some function stats that it hasn't yet
> - * sent to the collector.
> - */
> -static bool have_function_stats = false;
> +static HTAB *pgStatPendingFunctions = NULL;

So this patch leaves us with a pgStatFunctions that has a comment
explaining it's about "waiting to be sent" stats, and then additionally
a pgStatPendingFunctions?



>  /* ------------------------------------------------------------
>   * Public functions used by backends follow
> @@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
>   * pgstat_report_stat() -
>   *
>   *    Must be called by processes that performs DML: tcop/postgres.c, logical
> - *    receiver processes, SPI worker, etc. to send the so far collected
> - *    per-table and function usage statistics to the collector.  Note that this
> - *    is called only when not within a transaction, so it is fair to use
> - *    transaction stop time as an approximation of current time.
> - * ----------
> + *    receiver processes, SPI worker, etc. to apply the so far collected
> + *    per-table and function usage statistics to the shared statistics hashes.
> + *
> + *  This requires taking some locks on the shared statistics hashes and some

Weird mix of different indentation.


> + *  of updates may be withholded on lock failure. Pending updates are
> + *  retried in later call of this function and finally cleaned up by calling
> + *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
> + *  was elapsed since last cleanup. On the other hand updates by regular

s/was/has/



> +
> +    /* Forecibly update other stats if any. */

s/Forecibly/Forcibly/

Typo aside, what does forcibly mean here?



>      /*
> -     * Scan through the TabStatusArray struct(s) to find tables that actually
> -     * have counts, and build messages to send.  We have to separate shared
> -     * relations from regular ones because the databaseid field in the message
> -     * header has to depend on that.
> +     * XX: We cannot lock two dshash entries at once. Since we must keep lock

Typically we use three XXX (or alternatively NB:).


> +     * while tables stats are being updated we have no choice other than
> +     * separating jobs for shared table stats and that of egular tables.

s/egular/regular/

> +     * Looping over the array twice isapparently ineffcient and more efficient
> +     * way is expected.
>       */

s/isapparently/is apparently/
s/ineffcient/inefficient/

But I don't know what this sentence is trying to say precisely. Are you
saying this cannot be committed unless this is fixed?


Nor do I understand why it's actually relevant that we cannot lock two
dshash entries at once. The same table is never shared and unshared,
no?

> +/*
> + * Subroutine for pgstat_update_stat.
> + *
> + * Appies table stats in table status array merging with pending stats if any.

s/Appies/Applies/


> + * If force is true waits until required locks to be acquired. Elsewise stats

s/elsewise/ohterwise/

> + * merged stats as pending sats and it will be processed in the next chance.

s/sats/stats/
s/in the next change/at the next chance/


> +            /* if pending update exists, it should be applied along with */
> +            if (pgStatPendingTabHash != NULL)

Why is any of this done if there's no pending data?


>              {
> -                pgstat_send_tabstat(this_msg);
> -                this_msg->m_nentries = 0;
> +                pentry = hash_search(pgStatPendingTabHash,
> +                                     (void *) entry, HASH_FIND, NULL);
> +
> +                if (pentry)
> +                {
> +                    /* merge new update into pending updates */
> +                    pgstat_merge_tabentry(pentry, entry, false);
> +                    entry = pentry;
> +                }
> +            }
> +
> +            /* try to apply the merged stats */
> +            if (pgstat_apply_tabstat(cxt, entry, !force))
> +            {
> +                /* succeeded. remove it if it was pending stats */
> +                if (pentry && entry != pentry)
> +                    hash_search(pgStatPendingTabHash,
> +                                (void *) pentry, HASH_REMOVE, NULL);

Huh, how can entry != pentry in the case of pending stats? They're
literally set to the same value above?


> +            else if (!pentry)
> +            {
> +                /* failed and there was no pending entry, create new one. */
> +                bool found;
> +
> +                if (pgStatPendingTabHash == NULL)
> +                {
> +                    HASHCTL        ctl;
> +
> +                    memset(&ctl, 0, sizeof(ctl));
> +                    ctl.keysize = sizeof(Oid);
> +                    ctl.entrysize = sizeof(PgStat_TableStatus);
> +                    pgStatPendingTabHash =
> +                        hash_create("pgstat pending table stats hash",
> +                                    TABSTAT_QUANTUM,
> +                                    &ctl,
> +                                    HASH_ELEM | HASH_BLOBS);
> +                }
> +
> +                pentry = hash_search(pgStatPendingTabHash,
> +                                     (void *) entry, HASH_ENTER, &found);
> +                Assert (!found);
> +
> +                *pentry = *entry;
>              }
>          }
> -        /* zero out TableStatus structs after use */
> -        MemSet(tsa->tsa_entries, 0,
> -               tsa->tsa_used * sizeof(PgStat_TableStatus));
> -        tsa->tsa_used = 0;
> +    }

I don't understand why we do this at all.


> +    if (cxt->tabhash)
> +        dshash_detach(cxt->tabhash);

Huh, why do we detach here?


> +/*
> + * pgstat_apply_tabstat: update shared stats entry using given entry
> + *
> + * If nowait is true, just returns false on lock failure.  Dshashes for table
> + * and function stats are kept attached and stored in ctx. The caller must
> + * detach them after use.
> + */
> +bool
> +pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
> +                     PgStat_TableStatus *entry, bool nowait)
> +{
> +    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> +    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
> +    bool updated = false;
> +
> +    if (nowait)
> +        table_mode |= PGSTAT_FETCH_NOWAIT;
> +
> +    /*
> +     * We need to keep lock on dbentries for regular tables to avoid race
> +     * condition with drop database. So we hold it in the context variable. We
> +     * don't need that for shared tables.
> +     */
> +    if (!cxt->dbentry)
> +        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);

Oh, wait, what? *That's* the reason why we need to hold a lock on a
second entry?

Uhm, how can this actually be an issue? If we apply pending stats, we're
connected to the database, it therefore cannot be dropped while we're
applying stats, no?


> +    /* attach shared stats table if not yet */
> +    if (!cxt->tabhash)
> +    {
> +        /* apply database stats  */
> +        if (!entry->t_shared)
> +        {
> +            /* Update database-wide stats  */
> +            cxt->dbentry->n_xact_commit += pgStatXactCommit;
> +            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
> +            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
> +            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
> +            pgStatXactCommit = 0;
> +            pgStatXactRollback = 0;
> +            pgStatBlockReadTime = 0;
> +            pgStatBlockWriteTime = 0;
> +        }

Uh, this seems to have nothing to do with "attach shared stats table if
not yet".



> +    /* create hash if not yet */
> +    if (dbentry->functions == DSM_HANDLE_INVALID)
> +    {
> +        funchash = dshash_create(area, &dsh_funcparams, 0);
> +        dbentry->functions = dshash_get_hash_table_handle(funchash);
> +    }
> +    else
> +        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);

Why is this created on-demand?


> +    /*
> +     * First, we empty the transaction stats. Just move numbers to pending
> +     * stats if any. Elsewise try to directly update the shared stats but
> +     * create a new pending entry on lock failure.
> +     */
> +    if (pgStatFunctions)

I don't understand why we have both pgStatFunctions and pgStatFunctions
and pgStatPendingFunctions (and the same for other such pairs). That
seems to make no sense to me. The comments for the former literally are:
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */



>  static HTAB *
>  pgstat_collect_oids(Oid catalogid)
> @@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
>  /* ----------
>   * pgstat_drop_database() -
>   *
> - *    Tell the collector that we just dropped a database.
> - *    (If the message gets lost, we will still clean the dead DB eventually
> - *    via future invocations of pgstat_vacuum_stat().)
> + *    Remove entry for the database that we just dropped.
> + *
> + *  If some stats update happens after this, this entry will re-created but
> + *    we will still clean the dead DB eventually via future invocations of
> + *    pgstat_vacuum_stat().
>   * ----------
>   */
> +
>  void
>  pgstat_drop_database(Oid databaseid)
>  {

Mixed indentation, added newline.


> +/*
> + * snapshot_statentry() - Find an entriy from source dshash.
> + *

s/entriy/entry/


Ok, getting too tired now. Two AM in an airport lounge is not the
easiest place and time to concentrate...

I don't think this is all that close to being committable :(

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi Kyatoro,

On 2019-02-07 13:10:08 -0800, Andres Freund wrote:
> I don't think this is all that close to being committable :(

Are you planning to update this soon? I think this needs to be improved
pretty quickly to have any shot at getting into v12. I'm willing to put
in some resources towards that, but I definitely don't have the
resources to entirely polish it from my end.

Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello. Thank you for the comment.

At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund <andres@anarazel.de> wrote in
<20190207211008.nc3axviivmcoaluq@alap3.anarazel.de>
> Hi,
> 
> On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 7eed5866d2..e52ae54821 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
> >                          &sync_secs, &sync_usecs);
> >  
> >      /* Accumulate checkpoint timing summary data, in milliseconds. */
> > -    BgWriterStats.m_checkpoint_write_time +=
> > +    BgWriterStats.checkpoint_write_time +=
> >          write_secs * 1000 + write_usecs / 1000;
> > -    BgWriterStats.m_checkpoint_sync_time +=
> > +    BgWriterStats.checkpoint_sync_time +=
> >          sync_secs * 1000 + sync_usecs / 1000;
> 
> Why does this patch do renames like this in the same entry as actual
> functional changes?

Just because it is no longer "messages". I'm ok to preserve them
as historcal names. Reverted.

> > @@ -1273,16 +1276,22 @@ do_start_worker(void)
> >                  break;
> >              }
> >          }
> > -        if (skipit)
> > -            continue;
> > +        if (!skipit)
> > +        {
> > +            /* Remember the db with oldest autovac time. */
> > +            if (avdb == NULL ||
> > +                tmp->adw_entry->last_autovac_time <
> > +                avdb->adw_entry->last_autovac_time)
> > +            {
> > +                if (avdb)
> > +                    pfree(avdb->adw_entry);
> > +                avdb = tmp;
> > +            }
> > +        }
> >  
> > -        /*
> > -         * Remember the db with oldest autovac time.  (If we are here, both
> > -         * tmp->entry and db->entry must be non-null.)
> > -         */
> > -        if (avdb == NULL ||
> > -            tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
> > -            avdb = tmp;
> > +        /* Immediately free it if not used */
> > +        if(avdb != tmp)
> > +            pfree(tmp->adw_entry);
> >      }
> 
> This looks like another unrelated refactoring. Don't do this.

Rewrited it to less invasive way.

> >      /* Transfer stats counts into pending pgstats message */
> > -    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
> > -    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
> > +    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
> > +    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
> 
> More unrelated renaming.  I'll stop mentioning this in the rest of this
> patch, but really don't do this, makes such a large unnecessary harder
> to review.

AFAICS it is done only to the struct. I reverted all of them.

> pgstat.c needs a header comment explaining the architecture of the
> approach.

I write some. Please check it.

> > +/*
> > + * Operation mode of pgstat_get_db_entry.
> > + */
> > +#define PGSTAT_FETCH_SHARED    0
> > +#define PGSTAT_FETCH_EXCLUSIVE    1
> > +#define    PGSTAT_FETCH_NOWAIT 2
> > +
> > +typedef enum
> > +{
> 
> Please don't create anonymous enums that are then typedef'd to a
> name. The underlying name and the one not scoped to typedefs shoudl be
> the same.

Sorry. I named the struct PgStat_TableLookupState and typedef'ed
with the same name.

> > +/*
> > + *  report withholding facility.
> > + *
> > + *  some report items are withholded if required lock is not acquired
> > + *  immediately.
> > + */
> 
> This comment needs polishing. The variables are named _pending_, but the
> comment talks about withholding - which doesn't seem like an apt name.

The comment was updated at v12, but it still doesn't
good. Rewrite as follows:

| *  variables signals that the backend has some numbers that are waiting to be
| *  written to shared stats.

> >  /*
> >   * Structures in which backends store per-table info that's waiting to be
> > @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
> >   * Hash table for O(1) t_id -> tsa_entry lookup
> >   */
> >  static HTAB *pgStatTabHash = NULL;
> > +static HTAB *pgStatPendingTabHash = NULL;
> >  
> >  /*
> >   * Backends store per-function info that's waiting to be sent to the collector
> >   * in this hash table (indexed by function OID).
> >   */
> >  static HTAB *pgStatFunctions = NULL;
> > -
> > -/*
> > - * Indicates if backend has some function stats that it hasn't yet
> > - * sent to the collector.
> > - */
> > -static bool have_function_stats = false;
> > +static HTAB *pgStatPendingFunctions = NULL;
> 
> So this patch leaves us with a pgStatFunctions that has a comment
> explaining it's about "waiting to be sent" stats, and then additionally
> a pgStatPendingFunctions?

Mmm. Thanks . I changed the comment and separated pgSTatPending*
stuff from there and merged with pgstat_pending_*. And unified
the naming.


> >  /* ------------------------------------------------------------
> >   * Public functions used by backends follow
> > @@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
> >   * pgstat_report_stat() -
> >   *
> >   *    Must be called by processes that performs DML: tcop/postgres.c, logical
> > - *    receiver processes, SPI worker, etc. to send the so far collected
> > - *    per-table and function usage statistics to the collector.  Note that this
> > - *    is called only when not within a transaction, so it is fair to use
> > - *    transaction stop time as an approximation of current time.
> > - * ----------
> > + *    receiver processes, SPI worker, etc. to apply the so far collected
> > + *    per-table and function usage statistics to the shared statistics hashes.
> > + *
> > + *  This requires taking some locks on the shared statistics hashes and some
> 
> Weird mix of different indentation.

Fixed. Unifed to tabs.

> > + *  of updates may be withholded on lock failure. Pending updates are
> > + *  retried in later call of this function and finally cleaned up by calling
> > + *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
> > + *  was elapsed since last cleanup. On the other hand updates by regular
> 
> s/was/has/

Ugg. Fixed.

> > +
> > +    /* Forecibly update other stats if any. */
> 
> s/Forecibly/Forcibly/
> 
> Typo aside, what does forcibly mean here?

Meant that it should wait for lock to be acquired. But I don't
recall why. Changed it to follow "force" flag.


> >      /*
> > -     * Scan through the TabStatusArray struct(s) to find tables that actually
> > -     * have counts, and build messages to send.  We have to separate shared
> > -     * relations from regular ones because the databaseid field in the message
> > -     * header has to depend on that.
> > +     * XX: We cannot lock two dshash entries at once. Since we must keep lock
> 
> Typically we use three XXX (or alternatively NB:).

Fixed. I thought that the number 'X's represents how bad it is.
(Just kidding).

> > +     * while tables stats are being updated we have no choice other than
> > +     * separating jobs for shared table stats and that of egular tables.
> 
> s/egular/regular/

Fixed

> > +     * Looping over the array twice isapparently ineffcient and more efficient
> > +     * way is expected.
> >       */
> 
> s/isapparently/is apparently/
> s/ineffcient/inefficient/
> 
> But I don't know what this sentence is trying to say precisely. Are you
> saying this cannot be committed unless this is fixed?

Just I wanted to say I'd happy if could do all in a
loop. So.. I rewote it as the follows.

| * Flush pending stats separately for regular tables and shared tables
| * since we cannot hold locks on two dshash entries at once.

> Nor do I understand why it's actually relevant that we cannot lock two
> dshash entries at once. The same table is never shared and unshared,
> no?

Ah, I understood. A backend acculunates statistics on shared
tables, and tables of the connecting database. Shared table
entries are stored under database entry with id = 0 so we need to
use database entries of id = 0 and id = MyDatabaseId stored in
db_stats dshash.....

Ah, I found now that database entry for shared tables necessarily
in the db_stats dshash. Okey. I'll separate shared dbentry from
db_stats hash in the next version.

> > +/*
> > + * Subroutine for pgstat_update_stat.
> > + *
> > + * Appies table stats in table status array merging with pending stats if any.
> 
> s/Appies/Applies/
> 
> 
> > + * If force is true waits until required locks to be acquired. Elsewise stats
> 
> s/elsewise/ohterwise/
> 
> > + * merged stats as pending sats and it will be processed in the next chance.
> 
> s/sats/stats/
> s/in the next change/at the next chance/

Sorry for many silly (but even difficult for me) mistakes. All
fixed. I'll check all by ispell later.


> > +            /* if pending update exists, it should be applied along with */
> > +            if (pgStatPendingTabHash != NULL)
> 
> Why is any of this done if there's no pending data?

Sorry, but I don't follow it. We cannot do anything to what doesn't exist.

> >              {
> > -                pgstat_send_tabstat(this_msg);
> > -                this_msg->m_nentries = 0;
> > +                pentry = hash_search(pgStatPendingTabHash,
> > +                                     (void *) entry, HASH_FIND, NULL);
> > +
> > +                if (pentry)
> > +                {
> > +                    /* merge new update into pending updates */
> > +                    pgstat_merge_tabentry(pentry, entry, false);
> > +                    entry = pentry;
> > +                }
> > +            }
> > +
> > +            /* try to apply the merged stats */
> > +            if (pgstat_apply_tabstat(cxt, entry, !force))
> > +            {
> > +                /* succeeded. remove it if it was pending stats */
> > +                if (pentry && entry != pentry)
> > +                    hash_search(pgStatPendingTabHash,
> > +                                (void *) pentry, HASH_REMOVE, NULL);
> 
> Huh, how can entry != pentry in the case of pending stats? They're
> literally set to the same value above?

Seems right. Removed. I might have done something complex before..


> > +            else if (!pentry)
> > +            {
> > +                /* failed and there was no pending entry, create new one. */
> > +                bool found;
> > +
> > +                if (pgStatPendingTabHash == NULL)
> > +                {
> > +                    HASHCTL        ctl;
> > +
> > +                    memset(&ctl, 0, sizeof(ctl));
> > +                    ctl.keysize = sizeof(Oid);
> > +                    ctl.entrysize = sizeof(PgStat_TableStatus);
> > +                    pgStatPendingTabHash =
> > +                        hash_create("pgstat pending table stats hash",
> > +                                    TABSTAT_QUANTUM,
> > +                                    &ctl,
> > +                                    HASH_ELEM | HASH_BLOBS);
> > +                }
> > +
> > +                pentry = hash_search(pgStatPendingTabHash,
> > +                                     (void *) entry, HASH_ENTER, &found);
> > +                Assert (!found);
> > +
> > +                *pentry = *entry;
> >              }
> >          }
> > -        /* zero out TableStatus structs after use */
> > -        MemSet(tsa->tsa_entries, 0,
> > -               tsa->tsa_used * sizeof(PgStat_TableStatus));
> > -        tsa->tsa_used = 0;
> > +    }
> 
> I don't understand why we do this at all.

If we didn't have pending stats of the table, and failed to apply
the stats in TSA, we should move it into pending stats hash.

Though we could merge numbers in TSA into pending hash, then
flush pending hash, I prefer to avoid useless relocation of stats
numbers from TSA to pending stats hash. Does it make sense?

> 
> > +    if (cxt->tabhash)
> > +        dshash_detach(cxt->tabhash);
> 
> Huh, why do we detach here?

To release the lock on cxt->dbentry. It may be destroyed.

> > +/*
> > + * pgstat_apply_tabstat: update shared stats entry using given entry
> > + *
> > + * If nowait is true, just returns false on lock failure.  Dshashes for table
> > + * and function stats are kept attached and stored in ctx. The caller must
> > + * detach them after use.
> > + */
> > +bool
> > +pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
> > +                     PgStat_TableStatus *entry, bool nowait)
> > +{
> > +    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > +    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
> > +    bool updated = false;
> > +
> > +    if (nowait)
> > +        table_mode |= PGSTAT_FETCH_NOWAIT;
> > +
> > +    /*
> > +     * We need to keep lock on dbentries for regular tables to avoid race
> > +     * condition with drop database. So we hold it in the context variable. We
> > +     * don't need that for shared tables.
> > +     */
> > +    if (!cxt->dbentry)
> > +        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
> 
> Oh, wait, what? *That's* the reason why we need to hold a lock on a
> second entry?

Yeah, one of the reasons.

> Uhm, how can this actually be an issue? If we apply pending stats, we're
> connected to the database, it therefore cannot be dropped while we're
> applying stats, no?

Ah!!!!uch.  You're right. I'll consider that in the next version
soon. Thnaks for the insight.

> > +    /* attach shared stats table if not yet */
> > +    if (!cxt->tabhash)
> > +    {
> > +        /* apply database stats  */
> > +        if (!entry->t_shared)
> > +        {
> > +            /* Update database-wide stats  */
> > +            cxt->dbentry->n_xact_commit += pgStatXactCommit;
> > +            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
> > +            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
> > +            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
> > +            pgStatXactCommit = 0;
> > +            pgStatXactRollback = 0;
> > +            pgStatBlockReadTime = 0;
> > +            pgStatBlockWriteTime = 0;
> > +        }
> 
> Uh, this seems to have nothing to do with "attach shared stats table if
> not yet".

It's because the database stats needs to be applied once and
attaching tabhash happens once for a database. But, actually it
looks somewhat strange. In other words, I used cxt->tabhash as
the flat that indicates whether applying database stats is
requried. I rewote the comment there as the follows. It is
acceptable?

|  * If we haven't attached the tabhash, we didn't apply database stats
|  * yet. So apply it now..


> > +    /* create hash if not yet */
> > +    if (dbentry->functions == DSM_HANDLE_INVALID)
> > +    {
> > +        funchash = dshash_create(area, &dsh_funcparams, 0);
> > +        dbentry->functions = dshash_get_hash_table_handle(funchash);
> > +    }
> > +    else
> > +        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
> 
> Why is this created on-demand?

The reason is that function stats are optional, but one dshash
takes 1MB memory at creation time.

> > +    /*
> > +     * First, we empty the transaction stats. Just move numbers to pending
> > +     * stats if any. Elsewise try to directly update the shared stats but
> > +     * create a new pending entry on lock failure.
> > +     */
> > +    if (pgStatFunctions)
> 
> I don't understand why we have both pgStatFunctions and pgStatFunctions
> and pgStatPendingFunctions (and the same for other such pairs). That
> seems to make no sense to me. The comments for the former literally are:
>  /*
>   * Backends store per-function info that's waiting to be sent to the collector
>   * in this hash table (indexed by function OID).
>   */

I should have did that naively comparing to pgStatTabHash. I'll
remove pgStatPendingFunctions in the next version.

> >  static HTAB *
> >  pgstat_collect_oids(Oid catalogid)
> > @@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
> >  /* ----------
> >   * pgstat_drop_database() -
> >   *
> > - *    Tell the collector that we just dropped a database.
> > - *    (If the message gets lost, we will still clean the dead DB eventually
> > - *    via future invocations of pgstat_vacuum_stat().)
> > + *    Remove entry for the database that we just dropped.
> > + *
> > + *  If some stats update happens after this, this entry will re-created but
> > + *    we will still clean the dead DB eventually via future invocations of
> > + *    pgstat_vacuum_stat().
> >   * ----------
> >   */
> > +
> >  void
> >  pgstat_drop_database(Oid databaseid)
> >  {
> 
> Mixed indentation, added newline.

Fixed.

> > +/*
> > + * snapshot_statentry() - Find an entriy from source dshash.
> > + *
> 
> s/entriy/entry/
> 
> 
> Ok, getting too tired now. Two AM in an airport lounge is not the
> easiest place and time to concentrate...

Thank you very much for reviewing and sorry for the slow
response.

> I don't think this is all that close to being committable :(

I'm going to work harder on this. The remaining taks just now are
the follows:

- Separte shared database stats from db_stats hash.

- Consider relaxing dbentry locking.

- Try removing pgStatPendingFunctions

- ispell on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e303d4de0604790aabf7a57266457e0822c9e3af Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..d1908a6137 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 9fe0a7aab8d482e741360bd3d9b011fcd03deffa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index d1908a6137..db8d6899af 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..fe1d4d75c5 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From bcba035d42ce49ff0d5f2a2fb442a887c365d2f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..a6c3338d40 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -328,6 +328,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -455,6 +458,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..9e6bce8f6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2856,6 +2856,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4120,6 +4123,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..a663a62fd5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1761,7 +1763,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2924,7 +2926,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3069,10 +3071,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3318,7 +3318,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3523,6 +3523,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3799,6 +3811,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5068,7 +5081,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5346,6 +5359,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..471877d2df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 4097295b28ffcb308c5706c896d908c58f1b9ae7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/heap/vacuumlazy.c               |    1 +
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    1 +
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    2 +
 src/backend/access/transam/xact.c                  |    3 +
 src/backend/access/transam/xlog.c                  |    5 +-
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    1 +
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |   46 +-
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    5 +-
 src/backend/postmaster/checkpointer.c              |   17 +-
 src/backend/postmaster/pgarch.c                    |    5 +-
 src/backend/postmaster/pgstat.c                    | 6385 --------------------
 src/backend/postmaster/postmaster.c                |   86 +-
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |   15 +-
 src/backend/replication/logical/worker.c           |    5 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1781 ++++++
 src/backend/statmon/pgstat.c                       | 3962 ++++++++++++
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm.c                      |   24 +-
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/ipci.c                     |    6 +
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    5 +-
 src/backend/storage/lmgr/lwlocknames.txt           |    1 +
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |   28 +-
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |   51 +-
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/globals.c                   |    1 +
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |   15 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl       |    2 +-
 src/include/bestatus.h                             |  555 ++
 src/include/miscadmin.h                            |    2 +-
 src/include/pgstat.h                               |  937 +--
 src/include/storage/dsm.h                          |    3 +
 src/include/storage/lwlock.h                       |    3 +
 src/include/utils/timeout.h                        |    1 +
 src/test/modules/worker_spi/worker_spi.c           |    2 +-
 84 files changed, 6597 insertions(+), 7480 deletions(-)
 delete mode 100644 src/backend/postmaster/pgstat.c
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 create mode 100644 src/backend/statmon/pgstat.c
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 9cc4b2dc83..406efbd49b 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/relation.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index ea2e4bc242..7084960742 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 239d220c24..1ea71245df 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 478a96db9b..cc511672c9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..adfd5f40fd 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31889..928d53a68c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..69cd211369 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..c2a3ed0209 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -64,9 +64,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d83fa..cf034ba333 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index ce2b61631d..8d5cbfa41d 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a28fe474aa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..bbe9c0eb5f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..0dc9f39424 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
@@ -1569,6 +1570,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     PredicateLockTwoPhaseFinish(xid, isCommit);
 
     /* Count the prepared xact as committed or aborted */
+    AtEOXact_BEStatus(isCommit);
     AtEOXact_PgStat(isCommit);
 
     /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 92bda87804..ac32c44e05 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -2148,6 +2149,7 @@ CommitTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_BEStatus(true);
     AtEOXact_PgStat(true);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
@@ -2641,6 +2643,7 @@ AbortTransaction(void)
         AtEOXact_Files(false);
         AtEOXact_ComboCid();
         AtEOXact_HashTables(false);
+        AtEOXact_BEStatus(false);
         AtEOXact_PgStat(false);
         AtEOXact_ApplyLauncher(false);
         pgstat_report_xact_timestamp(0);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..c0f0a7195c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -8420,9 +8421,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..683c41575f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..53fa4890e9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a6c3338d40..79f624f0e0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -22,6 +22,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -328,9 +329,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -340,6 +338,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -416,6 +417,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index b79be91655..e53c0fb808 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -46,7 +46,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 5e74585d5e..03a703075e 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -41,6 +41,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 69d5a1f239..36859360b6 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 856daf6a7f..5a47eb4601 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2098708864..898a7916b0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index d1417454f2..4eead99293 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index a7def3168d..fa1cf6cffa 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..f79a70d6fe 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d1177b3855..b1328d34f5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
@@ -968,7 +969,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -977,6 +978,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -992,7 +994,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1004,6 +1006,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1016,7 +1019,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1028,6 +1031,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1226,7 +1230,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,7 +1269,12 @@ do_start_worker(void)
             }
         }
         if (skipit)
+        {
+            /* Immediately free it if not used */
+            if(avdb != tmp)
+                pfree(tmp->adw_entry);
             continue;
+        }
 
         /*
          * Remember the db with oldest autovac time.  (If we are here, both
@@ -1273,7 +1282,12 @@ do_start_worker(void)
          */
         if (avdb == NULL ||
             tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
+        {
+            if (avdb)
+                pfree(avdb->adw_entry);
+
             avdb = tmp;
+        }
     }
 
     /* Found a database -- process it */
@@ -1962,7 +1976,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2012,7 +2026,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = table_open(RelationRelationId, AccessShareLock);
 
@@ -2098,6 +2112,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2177,10 +2193,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2749,12 +2766,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2786,8 +2801,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2818,6 +2833,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2908,7 +2925,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..7d7d55ef1a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..c820d35fbc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -267,9 +268,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..9f70cd0e52 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -515,13 +516,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +683,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4342ebdab4..2a7c4fd1b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -35,6 +35,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -468,7 +469,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -478,7 +479,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
deleted file mode 100644
index 9e6bce8f6a..0000000000
--- a/src/backend/postmaster/pgstat.c
+++ /dev/null
@@ -1,6385 +0,0 @@
-/* ----------
- * pgstat.c
- *
- *    All the statistics collector stuff hacked up in one big, ugly file.
- *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
- *
- *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
- *
- *    src/backend/postmaster/pgstat.c
- * ----------
- */
-#include "postgres.h"
-
-#include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
-
-#include "pgstat.h"
-
-#include "access/heapam.h"
-#include "access/htup_details.h"
-#include "access/transam.h"
-#include "access/twophase_rmgr.h"
-#include "access/xact.h"
-#include "catalog/pg_database.h"
-#include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
-#include "miscadmin.h"
-#include "pg_trace.h"
-#include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
-#include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
-#include "storage/ipc.h"
-#include "storage/latch.h"
-#include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
-#include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
-#include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
-#include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
-
-/* ----------
- * Timer definitions.
- * ----------
- */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
-
-
-/* ----------
- * The initial size hints for the hash tables used in the collector.
- * ----------
- */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
-#define PGSTAT_FUNCTION_HASH_SIZE    512
-
-
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
-/* ----------
- * GUC parameters
- * ----------
- */
-bool        pgstat_track_activities = false;
-bool        pgstat_track_counts = false;
-int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
-
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
-{
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
-
-/*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
- */
-typedef struct TabStatHashEntry
-{
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
-
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
-
-/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
- */
-static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
-/*
- * Tuple insertion/deletion counts for an open transaction can't be propagated
- * into PgStat_TableStatus counters until we know if it is going to commit
- * or abort.  Hence, we keep these counts in per-subxact structs that live
- * in TopTransactionContext.  This data structure is designed on the assumption
- * that subxacts won't usually modify very many tables.
- */
-typedef struct PgStat_SubXactStatus
-{
-    int            nest_level;        /* subtransaction nest level */
-    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
-    PgStat_TableXactStatus *first;    /* head of list for this subxact */
-} PgStat_SubXactStatus;
-
-static PgStat_SubXactStatus *pgStatXactStack = NULL;
-
-static int    pgStatXactCommit = 0;
-static int    pgStatXactRollback = 0;
-PgStat_Counter pgStatBlockReadTime = 0;
-PgStat_Counter pgStatBlockWriteTime = 0;
-
-/* Record that's written to 2PC state file when pgstat state is persisted */
-typedef struct TwoPhasePgStatRecord
-{
-    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
-    PgStat_Counter tuples_updated;    /* tuples updated in xact */
-    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
-    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
-    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
-    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
-    bool        t_truncated;    /* was the relation truncated? */
-} TwoPhasePgStatRecord;
-
-/*
- * Info about current "snapshot" of stats file
- */
-static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
-
-/*
- * Total time charged to functions so far in the current backend.
- * We use this to help separate "self" and "other" time charges.
- * (We assume this initializes to zero.)
- */
-static instr_time total_func_time;
-
-
-/* ----------
- * Local function forward declarations
- * ----------
- */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
-static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
-/* ------------------------------------------------------------
- * Public functions called from postmaster follow
- * ------------------------------------------------------------
- */
-
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
-/*
- * subroutine for pgstat_reset_all
- */
-static void
-pgstat_reset_remove_files(const char *directory)
-{
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
-
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
-}
-
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
-
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
-/* ----------
- * pgstat_vacuum_stat() -
- *
- *    Will tell the collector about objects he can get rid of.
- * ----------
- */
-void
-pgstat_vacuum_stat(void)
-{
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
-}
-
-
-/* ----------
- * pgstat_collect_oids() -
- *
- *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
- */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
-{
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
-    Relation    rel;
-    HeapScanDesc scan;
-    HeapTuple    tup;
-    Snapshot    snapshot;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    rel = table_open(catalogid, AccessShareLock);
-    snapshot = RegisterSnapshot(GetLatestSnapshot());
-    scan = heap_beginscan(rel, snapshot, 0, NULL);
-    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
-    {
-        Oid            thisoid;
-        bool        isnull;
-
-        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
-        Assert(!isnull);
-
-        CHECK_FOR_INTERRUPTS();
-
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
-    }
-    heap_endscan(scan);
-    UnregisterSnapshot(snapshot);
-    table_close(rel, AccessShareLock);
-
-    return htab;
-}
-
-
-/* ----------
- * pgstat_drop_database() -
- *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
- */
-void
-pgstat_drop_database(Oid databaseid)
-{
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_reset_counters() -
- *
- *    Tell the statistics collector to reset counters for our database.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_counters(void)
-{
-    PgStat_MsgResetcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_shared_counters() -
- *
- *    Tell the statistics collector to reset cluster-wide shared counters.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_shared_counters(const char *target)
-{
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("unrecognized reset target: \"%s\"", target),
-                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_single_counter() -
- *
- *    Tell the statistics collector to reset a single counter.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
-{
-    PgStat_MsgResetsinglecounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_report_autovac() -
- *
- *    Called from autovacuum.c to report startup of an autovacuum process.
- *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
- *    the db OID must be passed in, instead.
- * ----------
- */
-void
-pgstat_report_autovac(Oid dboid)
-{
-    PgStat_MsgAutovacStart msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ---------
- * pgstat_report_vacuum() -
- *
- *    Tell the collector about the table we just vacuumed.
- * ---------
- */
-void
-pgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
-{
-    PgStat_MsgVacuum msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_analyze() -
- *
- *    Tell the collector about the table we just analyzed.
- *
- * Caller must provide new live- and dead-tuples estimates, as well as a
- * flag indicating whether to reset the changes_since_analyze counter.
- * --------
- */
-void
-pgstat_report_analyze(Relation rel,
-                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
-                      bool resetcounter)
-{
-    PgStat_MsgAnalyze msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    /*
-     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
-     * already inserted and/or deleted rows in the target table. ANALYZE will
-     * have counted such rows as live or dead respectively. Because we will
-     * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
-     */
-    if (rel->pgstat_info != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
-        {
-            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
-        }
-        /* count stuff inserted by already-aborted subxacts, too */
-        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-        /* Since ANALYZE's counts are estimates, we could have underflowed */
-        livetuples = Max(livetuples, 0);
-        deadtuples = Max(deadtuples, 0);
-    }
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_recovery_conflict() -
- *
- *    Tell the collector about a Hot Standby recovery conflict.
- * --------
- */
-void
-pgstat_report_recovery_conflict(int reason)
-{
-    PgStat_MsgRecoveryConflict msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- *    Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
-    PgStat_MsgDeadlock msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_tempfile() -
- *
- *    Tell the collector about a temporary file.
- * --------
- */
-void
-pgstat_report_tempfile(size_t filesize)
-{
-    PgStat_MsgTempFile msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
-void
-pgstat_init_function_usage(FunctionCallInfo fcinfo,
-                           PgStat_FunctionCallUsage *fcu)
-{
-    PgStat_BackendFunctionEntry *htabent;
-    bool        found;
-
-    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
-    {
-        /* stats not wanted */
-        fcu->fs = NULL;
-        return;
-    }
-
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
-
-    fcu->fs = &htabent->f_counts;
-
-    /* save stats for this function, later used to compensate for recursion */
-    fcu->save_f_total_time = htabent->f_counts.f_total_time;
-
-    /* save current backend-wide total time */
-    fcu->save_total = total_func_time;
-
-    /* get clock time as of function start */
-    INSTR_TIME_SET_CURRENT(fcu->f_start);
-}
-
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
- *
- * If no entry, return NULL, don't create a new one
- */
-PgStat_BackendFunctionEntry *
-find_funcstat_entry(Oid func_id)
-{
-    if (pgStatFunctions == NULL)
-        return NULL;
-
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
-}
-
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
- *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
- */
-void
-pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
-{
-    PgStat_FunctionCounts *fs = fcu->fs;
-    instr_time    f_total;
-    instr_time    f_others;
-    instr_time    f_self;
-
-    /* stats not wanted? */
-    if (fs == NULL)
-        return;
-
-    /* total elapsed time in this function call */
-    INSTR_TIME_SET_CURRENT(f_total);
-    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
-
-    /* self usage: elapsed minus anything already charged to other calls */
-    f_others = total_func_time;
-    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
-    f_self = f_total;
-    INSTR_TIME_SUBTRACT(f_self, f_others);
-
-    /* update backend-wide total time */
-    INSTR_TIME_ADD(total_func_time, f_self);
-
-    /*
-     * Compute the new f_total_time as the total elapsed time added to the
-     * pre-call value of f_total_time.  This is necessary to avoid
-     * double-counting any time taken by recursive calls of myself.  (We do
-     * not need any similar kluge for self time, since that already excludes
-     * any recursive calls.)
-     */
-    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
-
-    /* update counters in function stats table */
-    if (finalize)
-        fs->f_numcalls++;
-    fs->f_total_time = f_total;
-    INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
-}
-
-
-/* ----------
- * pgstat_initstats() -
- *
- *    Initialize a relcache entry to count access statistics.
- *    Called whenever a relation is opened.
- *
- *    We assume that a relcache entry's pgstat_info field is zeroed by
- *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
- * ----------
- */
-void
-pgstat_initstats(Relation rel)
-{
-    Oid            rel_id = rel->rd_id;
-    char        relkind = rel->rd_rel->relkind;
-
-    /* We only count stats for things that have storage */
-    if (!(relkind == RELKIND_RELATION ||
-          relkind == RELKIND_MATVIEW ||
-          relkind == RELKIND_INDEX ||
-          relkind == RELKIND_TOASTVALUE ||
-          relkind == RELKIND_SEQUENCE))
-    {
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    /*
-     * If we already set up this relation in the current transaction, nothing
-     * to do.
-     */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
-        return;
-
-    /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
-}
-
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
- */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
-{
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
-}
-
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
- *
- * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
- */
-PgStat_TableStatus *
-find_tabstat_entry(Oid rel_id)
-{
-    TabStatHashEntry *hash_entry;
-
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
-
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
-}
-
-/*
- * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
- */
-static PgStat_SubXactStatus *
-get_tabstat_stack_level(int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state == NULL || xact_state->nest_level != nest_level)
-    {
-        xact_state = (PgStat_SubXactStatus *)
-            MemoryContextAlloc(TopTransactionContext,
-                               sizeof(PgStat_SubXactStatus));
-        xact_state->nest_level = nest_level;
-        xact_state->prev = pgStatXactStack;
-        xact_state->first = NULL;
-        pgStatXactStack = xact_state;
-    }
-    return xact_state;
-}
-
-/*
- * add_tabstat_xact_level - add a new (sub)transaction state record
- */
-static void
-add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-    PgStat_TableXactStatus *trans;
-
-    /*
-     * If this is the first rel to be modified at the current nest level, we
-     * first have to push a transaction stack entry.
-     */
-    xact_state = get_tabstat_stack_level(nest_level);
-
-    /* Now make a per-table stack entry */
-    trans = (PgStat_TableXactStatus *)
-        MemoryContextAllocZero(TopTransactionContext,
-                               sizeof(PgStat_TableXactStatus));
-    trans->nest_level = nest_level;
-    trans->upper = pgstat_info->trans;
-    trans->parent = pgstat_info;
-    trans->next = xact_state->first;
-    xact_state->first = trans;
-    pgstat_info->trans = trans;
-}
-
-/*
- * pgstat_count_heap_insert - count a tuple insertion of n tuples
- */
-void
-pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_inserted += n;
-    }
-}
-
-/*
- * pgstat_count_heap_update - count a tuple update
- */
-void
-pgstat_count_heap_update(Relation rel, bool hot)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_updated++;
-
-        /* t_tuples_hot_updated is nontransactional, so just advance it */
-        if (hot)
-            pgstat_info->t_counts.t_tuples_hot_updated++;
-    }
-}
-
-/*
- * pgstat_count_heap_delete - count a tuple deletion
- */
-void
-pgstat_count_heap_delete(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_deleted++;
-    }
-}
-
-/*
- * pgstat_truncate_save_counters
- *
- * Whenever a table is truncated, we save its i/u/d counters so that they can
- * be cleared, and if the (sub)xact that executed the truncate later aborts,
- * the counters can be restored to the saved (pre-truncate) values.  Note we do
- * this on the first truncate in any particular subxact level only.
- */
-static void
-pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
-{
-    if (!trans->truncated)
-    {
-        trans->inserted_pre_trunc = trans->tuples_inserted;
-        trans->updated_pre_trunc = trans->tuples_updated;
-        trans->deleted_pre_trunc = trans->tuples_deleted;
-        trans->truncated = true;
-    }
-}
-
-/*
- * pgstat_truncate_restore_counters - restore counters when a truncate aborts
- */
-static void
-pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
-{
-    if (trans->truncated)
-    {
-        trans->tuples_inserted = trans->inserted_pre_trunc;
-        trans->tuples_updated = trans->updated_pre_trunc;
-        trans->tuples_deleted = trans->deleted_pre_trunc;
-    }
-}
-
-/*
- * pgstat_count_truncate - update tuple counters due to truncate
- */
-void
-pgstat_count_truncate(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_truncate_save_counters(pgstat_info->trans);
-        pgstat_info->trans->tuples_inserted = 0;
-        pgstat_info->trans->tuples_updated = 0;
-        pgstat_info->trans->tuples_deleted = 0;
-    }
-}
-
-/*
- * pgstat_update_heap_dead_tuples - update dead-tuples count
- *
- * The semantics of this are that we are reporting the nontransactional
- * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
- * rather than increasing, and the change goes straight into the per-table
- * counter, not into transactional state.
- */
-void
-pgstat_update_heap_dead_tuples(Relation rel, int delta)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
-}
-
-
-/* ----------
- * AtEOXact_PgStat
- *
- *    Called from access/transam/xact.c at top-level transaction commit/abort.
- * ----------
- */
-void
-AtEOXact_PgStat(bool isCommit)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Count transaction commit or abort.  (We use counters, not just bools,
-     * in case the reporting message isn't sent right away.)
-     */
-    if (isCommit)
-        pgStatXactCommit++;
-    else
-        pgStatXactRollback++;
-
-    /*
-     * Transfer transactional insert/update counts into the base tabstat
-     * entries.  We don't bother to free any of the transactional state, since
-     * it's all in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            /* restore pre-truncate stats (if any) in case of aborted xact */
-            if (!isCommit)
-                pgstat_truncate_restore_counters(trans);
-            /* count attempted actions regardless of commit/abort */
-            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-            if (isCommit)
-            {
-                tabstat->t_counts.t_truncated = trans->truncated;
-                if (trans->truncated)
-                {
-                    /* forget live/dead stats seen by backend thus far */
-                    tabstat->t_counts.t_delta_live_tuples = 0;
-                    tabstat->t_counts.t_delta_dead_tuples = 0;
-                }
-                /* insert adds a live tuple, delete removes one */
-                tabstat->t_counts.t_delta_live_tuples +=
-                    trans->tuples_inserted - trans->tuples_deleted;
-                /* update and delete each create a dead tuple */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_updated + trans->tuples_deleted;
-                /* insert, update, delete each count as one change event */
-                tabstat->t_counts.t_changed_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated +
-                    trans->tuples_deleted;
-            }
-            else
-            {
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                /* an aborted xact generates no changed_tuple events */
-            }
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/* ----------
- * AtEOSubXact_PgStat
- *
- *    Called from access/transam/xact.c at subtransaction commit/abort.
- * ----------
- */
-void
-AtEOSubXact_PgStat(bool isCommit, int nestDepth)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Transfer transactional insert/update counts into the next higher
-     * subtransaction state.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL &&
-        xact_state->nest_level >= nestDepth)
-    {
-        PgStat_TableXactStatus *trans;
-        PgStat_TableXactStatus *next_trans;
-
-        /* delink xact_state from stack immediately to simplify reuse case */
-        pgStatXactStack = xact_state->prev;
-
-        for (trans = xact_state->first; trans != NULL; trans = next_trans)
-        {
-            PgStat_TableStatus *tabstat;
-
-            next_trans = trans->next;
-            Assert(trans->nest_level == nestDepth);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            if (isCommit)
-            {
-                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
-                {
-                    if (trans->truncated)
-                    {
-                        /* propagate the truncate status one level up */
-                        pgstat_truncate_save_counters(trans->upper);
-                        /* replace upper xact stats with ours */
-                        trans->upper->tuples_inserted = trans->tuples_inserted;
-                        trans->upper->tuples_updated = trans->tuples_updated;
-                        trans->upper->tuples_deleted = trans->tuples_deleted;
-                    }
-                    else
-                    {
-                        trans->upper->tuples_inserted += trans->tuples_inserted;
-                        trans->upper->tuples_updated += trans->tuples_updated;
-                        trans->upper->tuples_deleted += trans->tuples_deleted;
-                    }
-                    tabstat->trans = trans->upper;
-                    pfree(trans);
-                }
-                else
-                {
-                    /*
-                     * When there isn't an immediate parent state, we can just
-                     * reuse the record instead of going through a
-                     * palloc/pfree pushup (this works since it's all in
-                     * TopTransactionContext anyway).  We have to re-link it
-                     * into the parent level, though, and that might mean
-                     * pushing a new entry into the pgStatXactStack.
-                     */
-                    PgStat_SubXactStatus *upper_xact_state;
-
-                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
-                    trans->next = upper_xact_state->first;
-                    upper_xact_state->first = trans;
-                    trans->nest_level = nestDepth - 1;
-                }
-            }
-            else
-            {
-                /*
-                 * On abort, update top-level tabstat counts, then forget the
-                 * subtransaction
-                 */
-
-                /* first restore values obliterated by truncate */
-                pgstat_truncate_restore_counters(trans);
-                /* count attempted actions regardless of commit/abort */
-                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                tabstat->trans = trans->upper;
-                pfree(trans);
-            }
-        }
-        pfree(xact_state);
-    }
-}
-
-
-/*
- * AtPrepare_PgStat
- *        Save the transactional stats state at 2PC transaction prepare.
- *
- * In this phase we just generate 2PC records for all the pending
- * transaction-dependent stats work.
- */
-void
-AtPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-            TwoPhasePgStatRecord record;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-
-            record.tuples_inserted = trans->tuples_inserted;
-            record.tuples_updated = trans->tuples_updated;
-            record.tuples_deleted = trans->tuples_deleted;
-            record.inserted_pre_trunc = trans->inserted_pre_trunc;
-            record.updated_pre_trunc = trans->updated_pre_trunc;
-            record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
-            record.t_truncated = trans->truncated;
-
-            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
-                                   &record, sizeof(TwoPhasePgStatRecord));
-        }
-    }
-}
-
-/*
- * PostPrepare_PgStat
- *        Clean up after successful PREPARE.
- *
- * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
- *
- * Note: AtEOXact_PgStat is not called during PREPARE.
- */
-void
-PostPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * We don't bother to free any of the transactional state, since it's all
-     * in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            tabstat = trans->parent;
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/*
- * 2PC processing routine for COMMIT PREPARED case.
- *
- * Load the saved counts into our local pgstats state.
- */
-void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
-                           void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, commit case */
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_truncated = rec->t_truncated;
-    if (rec->t_truncated)
-    {
-        /* forget live/dead stats seen by backend thus far */
-        pgstat_info->t_counts.t_delta_live_tuples = 0;
-        pgstat_info->t_counts.t_delta_dead_tuples = 0;
-    }
-    pgstat_info->t_counts.t_delta_live_tuples +=
-        rec->tuples_inserted - rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_updated + rec->tuples_deleted;
-    pgstat_info->t_counts.t_changed_tuples +=
-        rec->tuples_inserted + rec->tuples_updated +
-        rec->tuples_deleted;
-}
-
-/*
- * 2PC processing routine for ROLLBACK PREPARED case.
- *
- * Load the saved counts into our local pgstats state, but treat them
- * as aborted.
- */
-void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
-                          void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, abort case */
-    if (rec->t_truncated)
-    {
-        rec->tuples_inserted = rec->inserted_pre_trunc;
-        rec->tuples_updated = rec->updated_pre_trunc;
-        rec->tuples_deleted = rec->deleted_pre_trunc;
-    }
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_inserted + rec->tuples_updated;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
-/* ----------
- * pgstat_fetch_stat_tabentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
- *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatTabEntry *
-pgstat_fetch_stat_tabentry(Oid relid)
-{
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    /*
-     * If we didn't find it, maybe it's a shared table.
-     */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_funcentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one function or NULL.
- * ----------
- */
-PgStat_StatFuncEntry *
-pgstat_fetch_stat_funcentry(Oid func_id)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
-/*
- * ---------
- * pgstat_fetch_stat_archiver() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
- * ---------
- */
-PgStat_ArchiverStats *
-pgstat_fetch_stat_archiver(void)
-{
-    backend_read_statsfile();
-
-    return &archiverStats;
-}
-
-
-/*
- * ---------
- * pgstat_fetch_global() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
- * ---------
- */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
-{
-    backend_read_statsfile();
-
-    return &globalStats;
-}
-
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
-        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
-        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
-/*
- * Shut down a single backend's statistics reporting at process exit.
- *
- * Flush any remaining statistics counts out to the collector.
- * Without this, operations triggered during backend exit (such as
- * temp table deletions) won't be counted.
- *
- * Lastly, clear out our entry in the PgBackendStatus array.
- */
-static void
-pgstat_beshutdown_hook(int code, Datum arg)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    /*
-     * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
-     * database ID, so forget it.  (This means that accesses to pg_database
-     * during failed backend starts might never get counted.)
-     */
-    if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(!pgStatRunningInCollector);
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
-/* ------------------------------------------------------------
- * Local support functions follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
-
-    /*
-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
-     */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
-        return;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
-
-    /*
-     * Clear out the statistics buffer, so it can be re-used.
-     */
-    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
- *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
- */
-char *
-pgstat_clip_activity(const char *raw_activity)
-{
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
-}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a663a62fd5..a01b81a594 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
@@ -255,7 +256,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1302,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1756,11 +1749,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2595,8 +2583,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2927,8 +2913,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2995,13 +2979,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3076,22 +3053,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3550,22 +3511,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3761,8 +3706,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3801,8 +3744,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4003,8 +3945,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4977,18 +4917,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5101,12 +5029,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5976,7 +5898,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6029,8 +5950,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6262,7 +6181,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..3accdf7bcf 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a6fdba3f41..0de04159d5 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..e30b2dbcf0 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7027737e67..75a3208f74 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 55b91b5e12..ea1c7e643e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index dad2b3d065..dbb7c57ebc 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/htup_details.h"
 #include "access/table.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index a49e226967..607c7ebc24 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -62,10 +62,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ad44b2bf43..1b792f6626 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 28f5fc23aa..58c94794bc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,26 +86,28 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/table.h"
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -128,7 +130,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -144,6 +146,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -525,7 +530,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -863,7 +868,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9516515bc..dc675778e3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -26,6 +26,7 @@
 #include "access/table.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -477,7 +478,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1311,6 +1312,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..c60e69302a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6..02ec91d98e 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..bdca25499d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9b143f361b..2b38c0c4f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..292312d05c
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1781 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void bestatus_clear_snapshot(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
+        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
+        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * AtEOXact_BEStatus
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_BEStatus(bool isCommit)
+{
+    bestatus_clear_snapshot();
+}
+
+/*
+ * AtPrepare_BEStatus
+ *        Clear existing snapshot at 2PC transaction prepare.
+ */
+void
+AtPrepare_BEStatus(void)
+{
+    bestatus_clear_snapshot();
+}
+
+/* ----------
+ * bestatus_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+static void
+bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
new file mode 100644
index 0000000000..66fb1e9341
--- /dev/null
+++ b/src/backend/statmon/pgstat.c
@@ -0,0 +1,3962 @@
+/* ----------
+ * pgstat.c
+ *
+ *    Statistics collector facility.
+ *
+ *  Collects per-table and per-function usage statitics of backends and shares
+ *  them among all backends via shared memory. Every backend records
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_report_stat() is called at
+ *  the end of a transacion to flush out the local numbers to shared
+ *  memory. To avoid congestion on the shared memory, we do that not often
+ *  than PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, such numbers are
+ *  postponed to the next chance but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
+ * 
+ *  pgstat_fetch_stat_*() are used to read the statistics numbers. There are
+ *  two ways of reading the shared statistics. Transactional and
+ *  one-shot. Retrieved numbers are stored in local hash which persists until
+ *  transaction-end in the former type. One the other hand autovacuum, which
+ *  doesn't need such characteristics, uses one-shot mode, which just copies
+ *  the data into palloc'ed memory.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/pgstat.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "pgstat.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase_rmgr.h"
+#include "access/xact.h"
+#include "bestatus.h"
+#include "catalog/pg_database.h"
+#include "catalog/pg_proc.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+/* ----------
+ * Timer definitions.
+ * ----------
+ */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
+
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
+
+/* ----------
+ * The initial size hints for the hash tables used in the collector.
+ * ----------
+ */
+#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_FUNCTION_HASH_SIZE    512
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum PgStat_TableLookupState;
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} PgStat_TableLookupState;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_counts = false;
+int            pgstat_track_functions = TRACK_FUNC_OFF;
+
+/* ----------
+ * Built from GUC parameter
+ * ----------
+ */
+char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
+char       *pgstat_stat_filename = NULL;
+char       *pgstat_stat_tmpname = NULL;
+
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
+
+/*
+ * BgWriter global statistics counters (unused in other processes).
+ * Stored directly in a stats message structure so it can be sent
+ * without needing to copy things around.  We assume this inits to zeroes.
+ */
+PgStat_BgWriter BgWriterStats;
+
+/* ----------
+ * Local data
+ * ----------
+ */
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+
+/* memory context for snapshots */
+static MemoryContext pgStatLocalContext = NULL;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
+/*
+ * Structures in which backends store per-table info that's waiting to be
+ * written to shared stats.
+ *
+ * NOTE: once allocated, TabStatusArray structures are never moved or deleted
+ * for the life of the backend.  Also, we zero out the t_id fields of the
+ * contained PgStat_TableStatus structs whenever they are not actively in use.
+ * This allows relcache pgstat_info pointers to be treated as long-lived data,
+ * avoiding repeated searches in pgstat_initstats() when a relation is
+ * repeatedly opened during a transaction.
+ */
+#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
+
+typedef struct TabStatusArray
+{
+    struct TabStatusArray *tsa_next;    /* link to next array, if any */
+    int            tsa_used;        /* # entries currently used */
+    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
+} TabStatusArray;
+
+static TabStatusArray *pgStatTabList = NULL;
+
+/*
+ * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ */
+typedef struct TabStatHashEntry
+{
+    Oid            t_id;
+    PgStat_TableStatus *tsa_entry;
+} TabStatHashEntry;
+
+/*
+ * Hash table for O(1) t_id -> tsa_entry lookup
+ */
+static HTAB *pgStatTabHash = NULL;
+
+/*
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
+ */
+static HTAB *pgStatFunctions = NULL;
+
+/*
+ *  variables signal that the backend has numbers that have not been able to
+ *  be flushed out to shared memory in the past trials.
+ */
+static bool  pgStatPendingRecoveryConflicts = false;
+static bool  pgStatPendingDeadlocks = false;
+static bool  pgStatPendingTempfiles = false;
+static HTAB *pgStatPendingTabHash = NULL;
+static HTAB *pgStatPendingFunctions = NULL;
+
+/*
+ * Tuple insertion/deletion counts for an open transaction can't be propagated
+ * into PgStat_TableStatus counters until we know if it is going to commit
+ * or abort.  Hence, we keep these counts in per-subxact structs that live
+ * in TopTransactionContext.  This data structure is designed on the assumption
+ * that subxacts won't usually modify very many tables.
+ */
+typedef struct PgStat_SubXactStatus
+{
+    int            nest_level;        /* subtransaction nest level */
+    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
+    PgStat_TableXactStatus *first;    /* head of list for this subxact */
+} PgStat_SubXactStatus;
+
+static PgStat_SubXactStatus *pgStatXactStack = NULL;
+
+static int    pgStatXactCommit = 0;
+static int    pgStatXactRollback = 0;
+PgStat_Counter pgStatBlockReadTime = 0;
+PgStat_Counter pgStatBlockWriteTime = 0;
+
+/* Record that's written to 2PC state file when pgstat state is persisted */
+typedef struct TwoPhasePgStatRecord
+{
+    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
+    PgStat_Counter tuples_updated;    /* tuples updated in xact */
+    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
+    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
+    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
+    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
+    Oid            t_id;            /* table's OID */
+    bool        t_shared;        /* is it a shared catalog? */
+    bool        t_truncated;    /* was the relation truncated? */
+} TwoPhasePgStatRecord;
+
+typedef struct
+{
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbentry;
+} pgstat_apply_tabstat_context;
+
+/*
+ * Info about current snapshot of stats
+ */
+TimestampTz backend_cache_expire = 0; /* local cache expiration time */
+bool        first_in_xact = true;      /* first fetch after the last tr end */
+
+/*
+ * Cluster wide statistics.
+
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by statistics collector code and
+ * snapshot_* are cached stats for the reader code.
+ */
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
+/*
+ * Total time charged to functions so far in the current backend.
+ * We use this to help separate "self" and "other" time charges.
+ * (We assume this initializes to zero.)
+ */
+static instr_time total_func_time;
+
+
+/* ----------
+ * Local function forward declarations
+ * ----------
+ */
+/* functions used in backends */
+static void pgstat_beshutdown_hook(int code, Datum arg);
+
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupState *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_pending_tabstats(bool shared, bool force,
+                               pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static inline void pgstat_merge_backendstats_to_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_BackendFunctionEntry *src, bool init);
+static inline void pgstat_merge_funcentry(
+    PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src, bool init);
+
+static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
+static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_setup_memcxt(void);
+
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
+
+/* ------------------------------------------------------------
+ * Public functions called from postmaster follow
+ * ------------------------------------------------------------
+ */
+
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * subroutine for pgstat_reset_all
+ */
+static void
+pgstat_reset_remove_files(const char *directory)
+{
+    DIR           *dir;
+    struct dirent *entry;
+    char        fname[MAXPGPATH * 2];
+
+    dir = AllocateDir(directory);
+    while ((entry = ReadDir(dir, directory)) != NULL)
+    {
+        int            nchars;
+        Oid            tmp_oid;
+
+        /*
+         * Skip directory entries that don't match the file names we write.
+         * See get_dbstat_filename for the database-specific pattern.
+         */
+        if (strncmp(entry->d_name, "global.", 7) == 0)
+            nchars = 7;
+        else
+        {
+            nchars = 0;
+            (void) sscanf(entry->d_name, "db_%u.%n",
+                          &tmp_oid, &nchars);
+            if (nchars <= 0)
+                continue;
+            /* %u allows leading whitespace, so reject that */
+            if (strchr("0123456789", entry->d_name[3]) == NULL)
+                continue;
+        }
+
+        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
+            strcmp(entry->d_name + nchars, "stat") != 0)
+            continue;
+
+        snprintf(fname, sizeof(fname), "%s/%s", directory,
+                 entry->d_name);
+        unlink(fname);
+    }
+    FreeDir(dir);
+}
+
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    MemoryContextSwitchTo(pgStatLocalContext);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+}
+
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up by calling
+ *    this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *    has elapsed since last cleanup. On the other hand updates by regular
+ *    backends happen with the interval not shorter than
+ *    PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
+ *    transaction stop time as an approximation of current time. 
+ *    ----------
+ */
+long
+pgstat_update_stat(bool force)
+{
+    static TimestampTz last_report = 0;
+    static TimestampTz oldest_pending = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    pgstat_apply_tabstat_context cxt;
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    if (pgStatPendingRecoveryConflicts ||
+        pgStatPendingDeadlocks ||
+        pgStatPendingTempfiles ||
+        pgStatPendingFunctions)
+        other_pending_stats = true;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!other_pending_stats && !pgStatPendingTabHash &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    last_report = now;
+
+    /* Publish report time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_report)
+        StatsShmem->last_update = last_report;
+    LWLockRelease(StatsLock);
+    
+
+    /* setup stats update context*/
+    cxt.dbentry = NULL;
+    cxt.tabhash = NULL;
+
+    /*
+     * Try other pending stats first. Although this may be after flushing
+     * table stats, we do it here to reduce looking up of database entry.
+     */
+    if (other_pending_stats)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+
+        if (!force)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt.dbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt.dbentry)
+        {
+            /* clean up pending statistics if any */
+            if (pgStatPendingFunctions)
+                pgstat_update_funcstats(true, cxt.dbentry);
+            if (pgStatPendingRecoveryConflicts)
+                pgstat_cleanup_recovery_conflict(cxt.dbentry);
+            if (pgStatPendingDeadlocks)
+                pgstat_cleanup_deadlock(cxt.dbentry);
+            if (pgStatPendingTempfiles)
+                pgstat_cleanup_tempfile(cxt.dbentry);
+        }
+    }
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    /*
+     * Flush pending stats separately for regular tables and shared tables
+     * since we cannot hold locks on two dshash entries at once.
+     */
+
+    /* The first call of the followings uses dbentry obtained above if any.*/
+    pgstat_apply_pending_tabstats(false, force, &cxt);
+    pgstat_apply_pending_tabstats(true, force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    return 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Applies table stats in table status array merging with pending stats if
+ * any.  If force is true waits until required locks to be acquired. Otherwise
+ * stats merged stats as pending stats and it will be processed at the next
+ * chance.
+ */
+static void
+pgstat_apply_pending_tabstats(bool shared, bool force,
+                              pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
+
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            PgStat_TableStatus *pentry = NULL;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) == 0)
+                continue;
+
+            /* Skip if this entry is not match the request */
+            if (entry->t_shared != shared)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
+            {
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
+            }
+        }
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* Skip if this entry is not match the request */
+            if (pentry->t_shared != shared)
+                continue;
+
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->tabhash)
+        dshash_detach(cxt->tabhash);
+    if (cxt->dbentry)
+        dshash_release_lock(db_stats, cxt->dbentry);
+    cxt->tabhash = NULL;
+    cxt->dbentry = NULL;
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool updated = false;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /*
+     * We need to keep lock on dbentries for regular tables to avoid race
+     * condition with drop database. So we hold it in the context variable. We
+     * don't need that for shared tables.
+     */
+    if (!cxt->dbentry)
+        cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);
+
+    /* we cannot acquire lock, just return */
+    if (!cxt->dbentry)
+        return false;
+
+    /* attach shared stats table if not yet */
+    if (!cxt->tabhash)
+    {
+        /*
+         * If we haven't attached the tabhash, we didn't apply database stats
+         * yet. So apply it now..
+         */
+        if (!entry->t_shared)
+        {
+            /* Update database-wide stats  */
+            cxt->dbentry->n_xact_commit += pgStatXactCommit;
+            cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+            cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+            cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+
+        cxt->tabhash =
+            dshash_attach(area, &dsh_tblparams, cxt->dbentry->tables, 0);
+    }
+
+    /*
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step suceeded.
+     */
+    if (pgstat_update_tabentry(cxt->tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(cxt->dbentry, entry);
+        updated = true;
+    }
+
+    return updated;
+}
+
+/*
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleard if
+ * init is true.
+ */
+static void
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
+{
+    Assert (deststat != srcstat);
+
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
+    else
+    {
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
+    }
+}
+
+/*
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
+ */
+static void
+pgstat_update_funcstats(bool force, PgStat_StatDBEntry *dbentry)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_TableLookupState status = 0;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    bool          release_db = false;
+    int              table_op = PGSTAT_FETCH_EXCLUSIVE;
+
+    if (pgStatFunctions == NULL && pgStatPendingFunctions == NULL)
+        return;
+
+    if (nowait)
+        table_op += PGSTAT_FETCH_NOWAIT;
+
+    /* find the shared function stats table */
+    if (!dbentry)
+    {
+        dbentry = pgstat_get_db_entry(MyDatabaseId, table_op, &status);
+        release_db = true;
+    }
+
+    /* lock failure, return. */
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /*
+     *  create hash if not yet, we don't keep shared memory dead when function
+     *  stats are not being taken.
+     */
+    if (dbentry->functions == DSM_HANDLE_INVALID)
+    {
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(funchash);
+    }
+    else
+        funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Elsewise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+    if (pgStatFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_BackendFunctionEntry *bestat;
+
+        hash_seq_init(&fstat, pgStatFunctions);
+        while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            bool found;
+            bool init = false;
+            PgStat_StatFuncEntry *funcent = NULL;
+
+            /* Skip it if no counts accumulated since last time */
+            if (memcmp(&bestat->f_counts, &all_zeroes,
+                       sizeof(PgStat_FunctionCounts)) == 0)
+                continue;
+
+            /* find pending entry */
+            if (pgStatPendingFunctions)
+                funcent = (PgStat_StatFuncEntry *)
+                    hash_search(pgStatPendingFunctions,
+                                (void *) &(bestat->f_id), HASH_FIND, NULL);
+
+            if (!funcent)
+            {
+                /* pending entry not found, find shared stats entry */
+                funcent = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert_extended(funchash,
+                                                   (void *) &(bestat->f_id),
+                                                   &found, nowait);
+                if (funcent)
+                    init = !found;
+                else
+                {
+                    /* no shared stats entry. create a new pending one */
+                    funcent = (PgStat_StatFuncEntry *)
+                        hash_search(pgStatPendingFunctions,
+                                    (void *) &(bestat->f_id), HASH_ENTER, NULL);
+                    init = true;
+                }
+            }
+            Assert (funcent != NULL);
+
+            pgstat_merge_backendstats_to_funcentry(funcent, bestat, init);
+
+            /* reset used counts */
+            MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        }
+    }
+
+    /* Second, apply pending stats numbers to shared table */
+    if (pgStatPendingFunctions)
+    {
+        HASH_SEQ_STATUS fstat;
+        PgStat_StatFuncEntry *pendent;
+
+        hash_seq_init(&fstat, pgStatPendingFunctions);
+        while ((pendent = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+        {
+            PgStat_StatFuncEntry *funcent;
+            bool found;
+
+            funcent = (PgStat_StatFuncEntry *)
+                dshash_find_or_insert_extended(funchash,
+                                               (void *) &(pendent->functionid),
+                                               &found, nowait);
+            if (funcent)
+            {
+                pgstat_merge_funcentry(pendent, funcent, !found);
+                hash_search(pgStatPendingFunctions,
+                            (void *) &(pendent->functionid), HASH_REMOVE, NULL);
+            }
+        }
+
+        /* destroy the hsah if no entry remains */
+        if (hash_get_num_entries(pgStatPendingFunctions) == 0)
+        {
+            hash_destroy(pgStatPendingFunctions);
+            pgStatPendingFunctions = NULL;
+        }
+    }
+
+    if (release_db)
+        dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_merge_backendstats_to_funcentry: subroutine for
+ *                                             pgstat_update_funcstats
+ *
+ * Merges BackendFunctionEntry into StatFuncEntry
+ */
+static inline void
+pgstat_merge_backendstats_to_funcentry(PgStat_StatFuncEntry *dest,
+                                       PgStat_BackendFunctionEntry *src,
+                                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_counts.f_numcalls;
+        dest->f_total_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time =
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_counts.f_numcalls;
+        dest->f_total_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_total_time);
+        dest->f_self_time +=
+            INSTR_TIME_GET_MICROSEC(src->f_counts.f_self_time);
+    }
+}
+
+/*
+ * pgstat_merge_funcentry: subroutine for pgstat_update_funcstats
+ *
+ * Merges two StatFuncEntrys
+ */
+static inline void
+pgstat_merge_funcentry(PgStat_StatFuncEntry *dest, PgStat_StatFuncEntry *src,
+                       bool init)
+{
+    if (init)
+    {
+        /*
+         * If it's a new function entry, initialize counters to the values
+         * we just got.
+         */
+        dest->f_numcalls = src->f_numcalls;
+        dest->f_total_time = src->f_total_time;
+        dest->f_self_time = src->f_self_time;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        dest->f_numcalls += src->f_numcalls;
+        dest->f_total_time += src->f_total_time;
+        dest->f_self_time += src->f_self_time;
+    }
+}
+
+
+
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *    Remove objects he can get rid of.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
+
+    /*
+     * Read pg_database and make a list of OIDs of all existing databases
+     */
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+
+    /*
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
+     */
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            dbid = dbentry->databaseid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /* the DB entry for shared tables (with InvalidOid) is never dropped */
+        if (OidIsValid(dbid) &&
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            pgstat_drop_database(dbid);
+    }
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Lookup our own database entry; if not found, nothing more to do.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
+        return;
+
+    /*
+     * Similarly to above, make a list of all known relations in this DB.
+     */
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+
+    /*
+     * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
+     */
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            tabid = tabentry->tableid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
+    }
+    dshash_detach(dshtable);
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Now repeat the above steps for functions.  However, we needn't bother
+     * in the common case where no function stats are being collected.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
+        {
+            Oid            funcid = funcentry->functionid;
+
+            CHECK_FOR_INTERRUPTS();
+
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+                continue;
+
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
+        }
+
+        hash_destroy(oidtab);
+
+        dshash_detach(dshtable);
+    }
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/*
+ * pgstat_collect_oids() -
+ *
+ *    Collect the OIDs of all objects listed in the specified system catalog
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
+ */
+static HTAB *
+pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+{
+    HTAB       *htab;
+    HASHCTL        hash_ctl;
+    Relation    rel;
+    HeapScanDesc scan;
+    HeapTuple    tup;
+    Snapshot    snapshot;
+
+    memset(&hash_ctl, 0, sizeof(hash_ctl));
+    hash_ctl.keysize = sizeof(Oid);
+    hash_ctl.entrysize = sizeof(Oid);
+    hash_ctl.hcxt = CurrentMemoryContext;
+    htab = hash_create("Temporary table of OIDs",
+                       PGSTAT_TAB_HASH_SIZE,
+                       &hash_ctl,
+                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    rel = heap_open(catalogid, AccessShareLock);
+    snapshot = RegisterSnapshot(GetLatestSnapshot());
+    scan = heap_beginscan(rel, snapshot, 0, NULL);
+    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
+    {
+        Oid            thisoid;
+        bool        isnull;
+
+        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
+        Assert(!isnull);
+
+        CHECK_FOR_INTERRUPTS();
+
+        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+    }
+    heap_endscan(scan);
+    UnregisterSnapshot(snapshot);
+    heap_close(rel, AccessShareLock);
+
+    return htab;
+}
+
+
+/* ----------
+ * pgstat_drop_database() -
+ *
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
+ * ----------
+ */
+void
+pgstat_drop_database(Oid databaseid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
+}
+
+
+/* ----------
+ * pgstat_reset_counters() -
+ *
+ *    Reset counters for our database.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_counters(void)
+{
+    PgStat_StatDBEntry           *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /*
+     * We simply throw away all the database's table entries by recreating a
+     * new hash table for them.
+     */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* ----------
+ * pgstat_reset_shared_counters() -
+ *
+ *    Reset cluster-wide shared counters.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_shared_counters(const char *target)
+{
+    Assert(db_stats);
+
+    /* Reset the archiver statistics for the cluster. */
+    if (strcmp(target, "archiver") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else if (strcmp(target, "bgwriter") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("unrecognized reset target: \"%s\"", target),
+                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_reset_single_counter() -
+ *
+ *    Reset a single counter.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
+{
+    PgStat_StatDBEntry *dbentry;
+    
+
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * We simply throw away all the database's table hashes
+         */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(t);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *t =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(t);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(db_stats, dbentry);
+
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_report_autovac() -
+ *
+ *    Called from autovacuum.c to report startup of an autovacuum process.
+ *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
+ *    the db OID must be passed in, instead.
+ * ----------
+ */
+void
+pgstat_report_autovac(Oid dboid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/* ---------
+ * pgstat_report_vacuum() -
+ *
+ *    Repot about the table we just vacuumed.
+ * ---------
+ */
+void
+pgstat_report_vacuum(Oid tableoid, bool shared,
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_analyze() -
+ *
+ *    Report about the table we just analyzed.
+ *
+ * Caller must provide new live- and dead-tuples estimates, as well as a
+ * flag indicating whether to reset the changes_since_analyze counter.
+ * --------
+ */
+void
+pgstat_report_analyze(Relation rel,
+                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                      bool resetcounter)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
+     * already inserted and/or deleted rows in the target table. ANALYZE will
+     * have counted such rows as live or dead respectively. Because we will
+     * report our counts of such rows at transaction end, we should subtract
+     * off these counts from what we send to the collector now, else they'll
+     * be double-counted after commit.  (This approach also ensures that the
+     * collector ends up with the right numbers if we abort instead of
+     * committing.)
+     */
+    if (rel->pgstat_info != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+        {
+            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+        }
+        /* count stuff inserted by already-aborted subxacts, too */
+        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+        /* Since ANALYZE's counts are estimates, we could have underflowed */
+        livetuples = Max(livetuples, 0);
+        deadtuples = Max(deadtuples, 0);
+    }
+
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_recovery_conflict() -
+ *
+ *    Report a Hot Standby recovery conflict.
+ * --------
+ */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
+void
+pgstat_report_recovery_conflict(int reason)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgStatPendingRecoveryConflicts = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgStatPendingRecoveryConflicts = false;
+}
+
+/* --------
+ * pgstat_report_deadlock() -
+ *
+ *    Report a deadlock detected.
+ * --------
+ */
+static int pending_deadlocks = 0;
+
+void
+pgstat_report_deadlock(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pending_deadlocks++;
+    pgStatPendingDeadlocks = true;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+    pgStatPendingDeadlocks = false;
+}
+
+/* --------
+ * pgstat_report_tempfile() -
+ *
+ *    Report a temporary file.
+ * --------
+ */
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
+void
+pgstat_report_tempfile(size_t filesize)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    if (filesize > 0) /* Is't there a case where filesize is really 0? */
+    {
+        pgStatPendingTempfiles = true;
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+
+    if (!pgStatPendingTempfiles)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for temporary files
+ */
+static void
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
+{
+
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+    pgStatPendingTempfiles = false;
+
+}
+
+/*
+ * Initialize function call usage data.
+ * Called by the executor before invoking a function.
+ */
+void
+pgstat_init_function_usage(FunctionCallInfo fcinfo,
+                           PgStat_FunctionCallUsage *fcu)
+{
+    PgStat_BackendFunctionEntry *htabent;
+    bool        found;
+
+    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
+    {
+        /* stats not wanted */
+        fcu->fs = NULL;
+        return;
+    }
+
+    if (!pgStatFunctions)
+    {
+        /* First time through - initialize function stat table */
+        HASHCTL        hash_ctl;
+
+        memset(&hash_ctl, 0, sizeof(hash_ctl));
+        hash_ctl.keysize = sizeof(Oid);
+        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
+        pgStatFunctions = hash_create("Function stat entries",
+                                      PGSTAT_FUNCTION_HASH_SIZE,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Get the stats entry for this function, create if necessary */
+    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
+                          HASH_ENTER, &found);
+    if (!found)
+        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+
+    fcu->fs = &htabent->f_counts;
+
+    /* save stats for this function, later used to compensate for recursion */
+    fcu->save_f_total_time = htabent->f_counts.f_total_time;
+
+    /* save current backend-wide total time */
+    fcu->save_total = total_func_time;
+
+    /* get clock time as of function start */
+    INSTR_TIME_SET_CURRENT(fcu->f_start);
+}
+
+/*
+ * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
+ *        for specified function
+ *
+ * If no entry, return NULL, don't create a new one
+ */
+PgStat_BackendFunctionEntry *
+find_funcstat_entry(Oid func_id)
+{
+    if (pgStatFunctions == NULL)
+        return NULL;
+
+    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
+                                                       (void *) &func_id,
+                                                       HASH_FIND, NULL);
+}
+
+/*
+ * Calculate function call usage and update stat counters.
+ * Called by the executor after invoking a function.
+ *
+ * In the case of a set-returning function that runs in value-per-call mode,
+ * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ * calls for what the user considers a single call of the function.  The
+ * finalize flag should be TRUE on the last call.
+ */
+void
+pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
+{
+    PgStat_FunctionCounts *fs = fcu->fs;
+    instr_time    f_total;
+    instr_time    f_others;
+    instr_time    f_self;
+
+    /* stats not wanted? */
+    if (fs == NULL)
+        return;
+
+    /* total elapsed time in this function call */
+    INSTR_TIME_SET_CURRENT(f_total);
+    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
+
+    /* self usage: elapsed minus anything already charged to other calls */
+    f_others = total_func_time;
+    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
+    f_self = f_total;
+    INSTR_TIME_SUBTRACT(f_self, f_others);
+
+    /* update backend-wide total time */
+    INSTR_TIME_ADD(total_func_time, f_self);
+
+    /*
+     * Compute the new f_total_time as the total elapsed time added to the
+     * pre-call value of f_total_time.  This is necessary to avoid
+     * double-counting any time taken by recursive calls of myself.  (We do
+     * not need any similar kluge for self time, since that already excludes
+     * any recursive calls.)
+     */
+    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
+
+    /* update counters in function stats table */
+    if (finalize)
+        fs->f_numcalls++;
+    fs->f_total_time = f_total;
+    INSTR_TIME_ADD(fs->f_self_time, f_self);
+}
+
+
+/* ----------
+ * pgstat_initstats() -
+ *
+ *    Initialize a relcache entry to count access statistics.
+ *    Called whenever a relation is opened.
+ *
+ *    We assume that a relcache entry's pgstat_info field is zeroed by
+ *    relcache.c when the relcache entry is made; thereafter it is long-lived
+ *    data.  We can avoid repeated searches of the TabStatus arrays when the
+ *    same relation is touched repeatedly within a transaction.
+ * ----------
+ */
+void
+pgstat_initstats(Relation rel)
+{
+    Oid            rel_id = rel->rd_id;
+    char        relkind = rel->rd_rel->relkind;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /* We only count stats for things that have storage */
+    if (!(relkind == RELKIND_RELATION ||
+          relkind == RELKIND_MATVIEW ||
+          relkind == RELKIND_INDEX ||
+          relkind == RELKIND_TOASTVALUE ||
+          relkind == RELKIND_SEQUENCE))
+    {
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /*
+     * If we already set up this relation in the current transaction, nothing
+     * to do.
+     */
+    if (rel->pgstat_info != NULL &&
+        rel->pgstat_info->t_id == rel_id)
+        return;
+
+    /* Else find or make the PgStat_TableStatus entry, and update link */
+    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+}
+
+/*
+ * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ */
+static PgStat_TableStatus *
+get_tabstat_entry(Oid rel_id, bool isshared)
+{
+    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *entry;
+    TabStatusArray *tsa;
+    bool        found;
+
+    /*
+     * Create hash table if we don't have it already.
+     */
+    if (pgStatTabHash == NULL)
+    {
+        HASHCTL        ctl;
+
+        memset(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(Oid);
+        ctl.entrysize = sizeof(TabStatHashEntry);
+
+        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+    }
+
+    /*
+     * Find an entry or create a new one.
+     */
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    if (!found)
+    {
+        /* initialize new entry with null pointer */
+        hash_entry->tsa_entry = NULL;
+    }
+
+    /*
+     * If entry is already valid, we're done.
+     */
+    if (hash_entry->tsa_entry)
+        return hash_entry->tsa_entry;
+
+    /*
+     * Locate the first pgStatTabList entry with free space, making a new list
+     * entry if needed.  Note that we could get an OOM failure here, but if so
+     * we have left the hashtable and the list in a consistent state.
+     */
+    if (pgStatTabList == NULL)
+    {
+        /* Set up first pgStatTabList entry */
+        pgStatTabList = (TabStatusArray *)
+            MemoryContextAllocZero(TopMemoryContext,
+                                   sizeof(TabStatusArray));
+    }
+
+    tsa = pgStatTabList;
+    while (tsa->tsa_used >= TABSTAT_QUANTUM)
+    {
+        if (tsa->tsa_next == NULL)
+            tsa->tsa_next = (TabStatusArray *)
+                MemoryContextAllocZero(TopMemoryContext,
+                                       sizeof(TabStatusArray));
+        tsa = tsa->tsa_next;
+    }
+
+    /*
+     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
+     * the entry was already zeroed, either at creation or after last use.
+     */
+    entry = &tsa->tsa_entries[tsa->tsa_used++];
+    entry->t_id = rel_id;
+    entry->t_shared = isshared;
+
+    /*
+     * Now we can fill the entry in pgStatTabHash.
+     */
+    hash_entry->tsa_entry = entry;
+
+    return entry;
+}
+
+/*
+ * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+ *
+ * If no entry, return NULL, don't create a new one
+ *
+ * Note: if we got an error in the most recent execution of pgstat_report_stat,
+ * it's possible that an entry exists but there's no hashtable entry for it.
+ * That's okay, we'll treat this case as "doesn't exist".
+ */
+PgStat_TableStatus *
+find_tabstat_entry(Oid rel_id)
+{
+    TabStatHashEntry *hash_entry;
+
+    /* If hashtable doesn't exist, there are no entries at all */
+    if (!pgStatTabHash)
+        return NULL;
+
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
+    if (!hash_entry)
+        return NULL;
+
+    /* Note that this step could also return NULL, but that's correct */
+    return hash_entry->tsa_entry;
+}
+
+/*
+ * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
+ */
+static PgStat_SubXactStatus *
+get_tabstat_stack_level(int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state == NULL || xact_state->nest_level != nest_level)
+    {
+        xact_state = (PgStat_SubXactStatus *)
+            MemoryContextAlloc(TopTransactionContext,
+                               sizeof(PgStat_SubXactStatus));
+        xact_state->nest_level = nest_level;
+        xact_state->prev = pgStatXactStack;
+        xact_state->first = NULL;
+        pgStatXactStack = xact_state;
+    }
+    return xact_state;
+}
+
+/*
+ * add_tabstat_xact_level - add a new (sub)transaction state record
+ */
+static void
+add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+    PgStat_TableXactStatus *trans;
+
+    /*
+     * If this is the first rel to be modified at the current nest level, we
+     * first have to push a transaction stack entry.
+     */
+    xact_state = get_tabstat_stack_level(nest_level);
+
+    /* Now make a per-table stack entry */
+    trans = (PgStat_TableXactStatus *)
+        MemoryContextAllocZero(TopTransactionContext,
+                               sizeof(PgStat_TableXactStatus));
+    trans->nest_level = nest_level;
+    trans->upper = pgstat_info->trans;
+    trans->parent = pgstat_info;
+    trans->next = xact_state->first;
+    xact_state->first = trans;
+    pgstat_info->trans = trans;
+}
+
+/*
+ * pgstat_count_heap_insert - count a tuple insertion of n tuples
+ */
+void
+pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_inserted += n;
+    }
+}
+
+/*
+ * pgstat_count_heap_update - count a tuple update
+ */
+void
+pgstat_count_heap_update(Relation rel, bool hot)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_updated++;
+
+        /* t_tuples_hot_updated is nontransactional, so just advance it */
+        if (hot)
+            pgstat_info->t_counts.t_tuples_hot_updated++;
+    }
+}
+
+/*
+ * pgstat_count_heap_delete - count a tuple deletion
+ */
+void
+pgstat_count_heap_delete(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_deleted++;
+    }
+}
+
+/*
+ * pgstat_truncate_save_counters
+ *
+ * Whenever a table is truncated, we save its i/u/d counters so that they can
+ * be cleared, and if the (sub)xact that executed the truncate later aborts,
+ * the counters can be restored to the saved (pre-truncate) values.  Note we do
+ * this on the first truncate in any particular subxact level only.
+ */
+static void
+pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
+{
+    if (!trans->truncated)
+    {
+        trans->inserted_pre_trunc = trans->tuples_inserted;
+        trans->updated_pre_trunc = trans->tuples_updated;
+        trans->deleted_pre_trunc = trans->tuples_deleted;
+        trans->truncated = true;
+    }
+}
+
+/*
+ * pgstat_truncate_restore_counters - restore counters when a truncate aborts
+ */
+static void
+pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
+{
+    if (trans->truncated)
+    {
+        trans->tuples_inserted = trans->inserted_pre_trunc;
+        trans->tuples_updated = trans->updated_pre_trunc;
+        trans->tuples_deleted = trans->deleted_pre_trunc;
+    }
+}
+
+/*
+ * pgstat_count_truncate - update tuple counters due to truncate
+ */
+void
+pgstat_count_truncate(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_truncate_save_counters(pgstat_info->trans);
+        pgstat_info->trans->tuples_inserted = 0;
+        pgstat_info->trans->tuples_updated = 0;
+        pgstat_info->trans->tuples_deleted = 0;
+    }
+}
+
+/*
+ * pgstat_update_heap_dead_tuples - update dead-tuples count
+ *
+ * The semantics of this are that we are reporting the nontransactional
+ * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
+ * rather than increasing, and the change goes straight into the per-table
+ * counter, not into transactional state.
+ */
+void
+pgstat_update_heap_dead_tuples(Relation rel, int delta)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
+}
+
+
+/* ----------
+ * AtEOXact_PgStat
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_PgStat(bool isCommit)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Count transaction commit or abort.  (We use counters, not just bools,
+     * in case the reporting message isn't sent right away.)
+     */
+    if (isCommit)
+        pgStatXactCommit++;
+    else
+        pgStatXactRollback++;
+
+    /*
+     * Transfer transactional insert/update counts into the base tabstat
+     * entries.  We don't bother to free any of the transactional state, since
+     * it's all in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            /* restore pre-truncate stats (if any) in case of aborted xact */
+            if (!isCommit)
+                pgstat_truncate_restore_counters(trans);
+            /* count attempted actions regardless of commit/abort */
+            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+            if (isCommit)
+            {
+                tabstat->t_counts.t_truncated = trans->truncated;
+                if (trans->truncated)
+                {
+                    /* forget live/dead stats seen by backend thus far */
+                    tabstat->t_counts.t_delta_live_tuples = 0;
+                    tabstat->t_counts.t_delta_dead_tuples = 0;
+                }
+                /* insert adds a live tuple, delete removes one */
+                tabstat->t_counts.t_delta_live_tuples +=
+                    trans->tuples_inserted - trans->tuples_deleted;
+                /* update and delete each create a dead tuple */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_updated + trans->tuples_deleted;
+                /* insert, update, delete each count as one change event */
+                tabstat->t_counts.t_changed_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated +
+                    trans->tuples_deleted;
+            }
+            else
+            {
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                /* an aborted xact generates no changed_tuple events */
+            }
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
+}
+
+/* ----------
+ * AtEOSubXact_PgStat
+ *
+ *    Called from access/transam/xact.c at subtransaction commit/abort.
+ * ----------
+ */
+void
+AtEOSubXact_PgStat(bool isCommit, int nestDepth)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Transfer transactional insert/update counts into the next higher
+     * subtransaction state.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL &&
+        xact_state->nest_level >= nestDepth)
+    {
+        PgStat_TableXactStatus *trans;
+        PgStat_TableXactStatus *next_trans;
+
+        /* delink xact_state from stack immediately to simplify reuse case */
+        pgStatXactStack = xact_state->prev;
+
+        for (trans = xact_state->first; trans != NULL; trans = next_trans)
+        {
+            PgStat_TableStatus *tabstat;
+
+            next_trans = trans->next;
+            Assert(trans->nest_level == nestDepth);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            if (isCommit)
+            {
+                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
+                {
+                    if (trans->truncated)
+                    {
+                        /* propagate the truncate status one level up */
+                        pgstat_truncate_save_counters(trans->upper);
+                        /* replace upper xact stats with ours */
+                        trans->upper->tuples_inserted = trans->tuples_inserted;
+                        trans->upper->tuples_updated = trans->tuples_updated;
+                        trans->upper->tuples_deleted = trans->tuples_deleted;
+                    }
+                    else
+                    {
+                        trans->upper->tuples_inserted += trans->tuples_inserted;
+                        trans->upper->tuples_updated += trans->tuples_updated;
+                        trans->upper->tuples_deleted += trans->tuples_deleted;
+                    }
+                    tabstat->trans = trans->upper;
+                    pfree(trans);
+                }
+                else
+                {
+                    /*
+                     * When there isn't an immediate parent state, we can just
+                     * reuse the record instead of going through a
+                     * palloc/pfree pushup (this works since it's all in
+                     * TopTransactionContext anyway).  We have to re-link it
+                     * into the parent level, though, and that might mean
+                     * pushing a new entry into the pgStatXactStack.
+                     */
+                    PgStat_SubXactStatus *upper_xact_state;
+
+                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
+                    trans->next = upper_xact_state->first;
+                    upper_xact_state->first = trans;
+                    trans->nest_level = nestDepth - 1;
+                }
+            }
+            else
+            {
+                /*
+                 * On abort, update top-level tabstat counts, then forget the
+                 * subtransaction
+                 */
+
+                /* first restore values obliterated by truncate */
+                pgstat_truncate_restore_counters(trans);
+                /* count attempted actions regardless of commit/abort */
+                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                tabstat->trans = trans->upper;
+                pfree(trans);
+            }
+        }
+        pfree(xact_state);
+    }
+}
+
+
+/*
+ * AtPrepare_PgStat
+ *        Save the transactional stats state at 2PC transaction prepare.
+ *
+ * In this phase we just generate 2PC records for all the pending
+ * transaction-dependent stats work.
+ */
+void
+AtPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+            TwoPhasePgStatRecord record;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+
+            record.tuples_inserted = trans->tuples_inserted;
+            record.tuples_updated = trans->tuples_updated;
+            record.tuples_deleted = trans->tuples_deleted;
+            record.inserted_pre_trunc = trans->inserted_pre_trunc;
+            record.updated_pre_trunc = trans->updated_pre_trunc;
+            record.deleted_pre_trunc = trans->deleted_pre_trunc;
+            record.t_id = tabstat->t_id;
+            record.t_shared = tabstat->t_shared;
+            record.t_truncated = trans->truncated;
+
+            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
+                                   &record, sizeof(TwoPhasePgStatRecord));
+        }
+    }
+}
+
+/*
+ * PostPrepare_PgStat
+ *        Clean up after successful PREPARE.
+ *
+ * All we need do here is unlink the transaction stats state from the
+ * nontransactional state.  The nontransactional action counts will be
+ * reported to the stats collector immediately, while the effects on live
+ * and dead tuple counts are preserved in the 2PC state file.
+ *
+ * Note: AtEOXact_PgStat is not called during PREPARE.
+ */
+void
+PostPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * We don't bother to free any of the transactional state, since it's all
+     * in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            tabstat = trans->parent;
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+}
+
+/*
+ * 2PC processing routine for COMMIT PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state.
+ */
+void
+pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+                           void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, commit case */
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_truncated = rec->t_truncated;
+    if (rec->t_truncated)
+    {
+        /* forget live/dead stats seen by backend thus far */
+        pgstat_info->t_counts.t_delta_live_tuples = 0;
+        pgstat_info->t_counts.t_delta_dead_tuples = 0;
+    }
+    pgstat_info->t_counts.t_delta_live_tuples +=
+        rec->tuples_inserted - rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_updated + rec->tuples_deleted;
+    pgstat_info->t_counts.t_changed_tuples +=
+        rec->tuples_inserted + rec->tuples_updated +
+        rec->tuples_deleted;
+}
+
+/*
+ * 2PC processing routine for ROLLBACK PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state, but treat them
+ * as aborted.
+ */
+void
+pgstat_twophase_postabort(TransactionId xid, uint16 info,
+                          void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, abort case */
+    if (rec->t_truncated)
+    {
+        rec->tuples_inserted = rec->inserted_pre_trunc;
+        rec->tuples_updated = rec->updated_pre_trunc;
+        rec->tuples_deleted = rec->deleted_pre_trunc;
+    }
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_inserted + rec->tuples_updated;
+}
+
+/* ----------
+ * pgstat_fetch_stat_tabentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    that the table doesn't exist, it is just not yet known by the
+ *    collector, so the caller is better off to report ZERO instead.
+ * ----------
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry(Oid relid)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    /*
+     * If we didn't find it, maybe it's a shared table.
+     */
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    return NULL;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_funcentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one function or NULL.
+ * ----------
+ */
+PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry(Oid func_id)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatFuncEntry *funcentry = NULL;
+
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = backend_get_func_etnry(dbentry, func_id, false);
+
+    dshash_release_lock(db_stats, dbentry);
+    return funcentry;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_archiver() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the archiver statistics struct.
+ * ---------
+ */
+PgStat_ArchiverStats *
+pgstat_fetch_stat_archiver(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_archiverStats;
+}
+
+
+/*
+ * ---------
+ * pgstat_fetch_global() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the global statistics struct.
+ * ---------
+ */
+PgStat_GlobalStats *
+pgstat_fetch_global(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_globalStats;
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    /*
+     * If we got as far as discovering our own database ID, we can report what
+     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * database ID, so forget it.  (This means that accesses to pg_database
+     * during failed backend starts might never get counted.)
+     */
+    if (OidIsValid(MyDatabaseId))
+        pgstat_update_stat(true);
+}
+
+
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+
+/* ----------
+ * pgstat_update_archiver() -
+ *
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
+ * ----------
+ */
+void
+pgstat_update_archiver(const char *xlog, bool failed)
+{
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+}
+
+/* ----------
+ * pgstat_update_bgwriter() -
+ *
+ *        Update bgwriter statistics
+ * ----------
+ */
+void
+pgstat_update_bgwriter(void)
+{
+    /* We assume this initializes to zeroes */
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+
+    /*
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid sending a completely empty message to the stats
+     * collector.
+     */
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
+
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
+ *
+ * Tables and functions hashes are initialized to empty.
+ */
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
+{
+    dshash_table *tbl;
+
+    dbentry->n_xact_commit = 0;
+    dbentry->n_xact_rollback = 0;
+    dbentry->n_blocks_fetched = 0;
+    dbentry->n_blocks_hit = 0;
+    dbentry->n_tuples_returned = 0;
+    dbentry->n_tuples_fetched = 0;
+    dbentry->n_tuples_inserted = 0;
+    dbentry->n_tuples_updated = 0;
+    dbentry->n_tuples_deleted = 0;
+    dbentry->last_autovac_time = 0;
+    dbentry->n_conflict_tablespace = 0;
+    dbentry->n_conflict_lock = 0;
+    dbentry->n_conflict_snapshot = 0;
+    dbentry->n_conflict_bufferpin = 0;
+    dbentry->n_conflict_startup_deadlock = 0;
+    dbentry->n_temp_files = 0;
+    dbentry->n_temp_bytes = 0;
+    dbentry->n_deadlocks = 0;
+    dbentry->n_block_read_time = 0;
+    dbentry->n_block_write_time = 0;
+
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+    dbentry->stats_timestamp = 0;
+
+
+    Assert(dbentry->tables == DSM_HANDLE_INVALID);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    Assert(dbentry->functions == DSM_HANDLE_INVALID);
+    /* we create function hash as needed */
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
+}
+
+/*
+ * Lookup the hash table entry for the specified database. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupState *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 &lock_acquired);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
+
+    return result;
+}
+
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
+{
+    PgStat_StatTabEntry *result;
+    bool        found;
+
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
+
+    if (!create && !found)
+        return NULL;
+
+    /* If not found, initialize the new one. */
+    if (!found)
+    {
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
+    }
+
+    return result;
+}
+
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_db_statsfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid,
+                    char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *        Write the stat file for a single database.
+ *
+ *    If writing to the permanent file (happens when the collector is
+ *    shutting down only), remove the temporary file so that backends
+ *    starting up under a new postmaster can't read the old data before
+ *    the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgStatLocalContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stas area
+     */
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        LWLockRelease(StatsLock);
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
+                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
+                if (found)
+                {
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
+
+                /*
+                 * In the collector, disregard the timestamp we read from the
+                 * permanent stats file; we should be willing to write a temp
+                 * stats file immediately upon the first request from any
+                 * backend.
+                 */
+                dbentry->stats_timestamp = 0;
+
+                /*
+                 * If requested, read the data from the database-specific
+                 * file.  Otherwise we just leave the hashtables empty.
+                 */
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    LWLockRelease(StatsLock);
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if table data not wanted.
+                 */
+                if (tabhash == NULL)
+                    break;
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if function data not wanted.
+                 */
+                if (funchash == NULL)
+                    break;
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_clean_snapshot: clean up the local cache that will cause new
+ * snapshots to bo read.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    Assert(pgStatLocalContext);
+    MemoryContextReset(pgStatLocalContext);
+
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
+}
+
+/*
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
+ */
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
+{
+    HTAB *result;
+    HASHCTL ctl;
+
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = pgStatLocalContext;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
+}
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in caller's memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporaralily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+        bool *negative;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            /* make room for negative flag at the end of entry */
+            *dest = create_local_stats_hash(hashname, keysize,
+                                            entrysize + sizeof(bool), 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+
+        /* negative flag is placed at the end of the entry */
+        negative = (bool *) (lentry + entrysize);
+
+        if (!found)
+        {
+            /* not found in local cache, search shared hash */
+
+            dshash_table *t = dshash;
+            void *sentry;
+
+            /* attach shared hash if not given */
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+
+            if (sentry)
+            {
+                memcpy(lentry, sentry, entrysize);
+                dshash_release_lock(t, sentry);
+            }
+
+            *negative = !sentry;
+
+            /* Release it if we attached it here */
+            if (!dshash)
+                dshash_detach(t);
+
+            if (!sentry)
+                return NULL;
+        }
+
+        if (*negative)
+            lentry = NULL;
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return (void *) lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    TimestampTz update_time = 0;
+        
+
+    /*
+     * This is the first call in a transaction. If we find the shared stats
+     * updated, throw away the cache.
+     */
+    if (IsTransactionState() && first_in_xact)
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (backend_cache_expire < update_time)
+        {
+            pgstat_clear_snapshot();
+
+            /*
+             * Shared stats are updated frequently when many backends are
+             * running, but we don't want the cached stats to be expired so
+             * frequently. Keep them at least for the same duration with
+             * minimul stats update interval of a backend.
+             */
+            backend_cache_expire =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+    
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(pgStatLocalContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If onshot is true, they are not cached and returned in a
+ *    palloc'ed memory in caller's context.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If onshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_etnry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Activity statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
+}
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..4edd980ffc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c2c445dbf4..0bb2132c71 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 1f766d20d1..a0401ee494 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..6bc5fd6089 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index cab7ae74ca..c7c248878a 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aeda32c9c5..e84275d4c2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..97bca9be24 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -281,8 +282,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 7da337d11f..97526f1c72 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf93357997..e893984383 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 6e471c3e43..cfa5c9089f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57a80..243da57c49 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 74eb449060..dd76088a29 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..2cd4d5531e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6fc11f26f0..a8efa7cc5f 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -194,8 +194,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0da5b19719..a60fd02894 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..9e9995ae50 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 36cfd507b2..eeda0c04f5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -3159,6 +3160,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3733,6 +3740,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4181,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4226,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4234,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d330a88e3c..c0975a8259 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -21,6 +21,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -29,7 +30,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..667e8e5560 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -33,7 +34,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1193,7 +1194,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1209,7 +1210,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1225,7 +1226,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1241,7 +1242,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1257,7 +1258,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1273,7 +1274,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1289,7 +1290,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1305,7 +1306,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1321,7 +1322,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1336,7 +1337,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1354,7 +1355,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1370,7 +1371,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1385,7 +1386,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1400,7 +1401,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1415,7 +1416,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1430,7 +1431,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1445,7 +1446,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1460,7 +1461,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1480,7 +1481,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1496,7 +1497,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1512,7 +1513,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1867,6 +1868,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1896,6 +1900,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 5e61d908fd..2dd99f935d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd2e4e89d8..1eabc0f41d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..e5dca7fe03 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -26,6 +26,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -72,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -685,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
@@ -1238,6 +1245,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 41d477165c..fb7856517e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3e1c3863c4..25b3b2a079 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..b7f6a93130
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,555 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ *
+ * All char arrays must be null-terminated.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];
+    char        ssl_cipher[NAMEDATALEN];
+    char        ssl_client_dn[NAMEDATALEN];
+
+    /*
+     * serial number is max "20 octets" per RFC 5280, so this size should be
+     * fine
+     */
+    char        ssl_client_serial[NAMEDATALEN];
+
+    char        ssl_issuer_dn[NAMEDATALEN];
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+extern void AtEOXact_BEStatus(bool isCommit);
+extern void AtPrepare_BEStatus(void);
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..49131a6d5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 471877d2df..92c9adf48e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,11 +13,9 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +39,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +87,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +145,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -420,40 +161,16 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_buf_alloc;
     PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
     PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +202,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -601,10 +245,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -660,7 +307,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +323,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -694,432 +341,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- *
- * All char arrays must be null-terminated.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];
-    char        ssl_cipher[NAMEDATALEN];
-    char        ssl_client_dn[NAMEDATALEN];
-
-    /*
-     * serial number is max "20 octets" per RFC 5280, so this size should be
-     * fine
-     */
-    char        ssl_client_serial[NAMEDATALEN];
-
-    char        ssl_issuer_dn[NAMEDATALEN];
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -1141,18 +362,18 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1164,34 +385,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1202,87 +409,20 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 
 /* nontransactional event counts are simple enough to inline */
 
@@ -1348,21 +488,30 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
+
+extern void pgstat_report_tempfile(size_t filesize);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 7c44f4a6e7..c37ec33e9b 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..daa269f816 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index c1878dd694..7391e05f37 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From 86935bf26a4aebbf03715de6a8b98104a2094b2f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/statmon/pgstat.c                  | 13 ++++-----
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 07b847a8e9..ae226b9e3d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6709,25 +6709,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 7a84f51340..b37a2fd165 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index cbdad0c3fb..133eb3ff19 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index e30b2dbcf0..a567aacf73 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -231,11 +231,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -266,13 +263,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
index 66fb1e9341..1bc2557388 100644
--- a/src/backend/statmon/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -85,15 +85,12 @@ typedef enum PgStat_TableLookupState;
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap infomation */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fb7856517e..ed4a14587c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -192,7 +192,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3988,17 +3987,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11002,35 +10990,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ad6c436f93..c7648dcb47 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,7 +554,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 92c9adf48e..9ab6df5fae 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -28,7 +28,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2019-02-15 17:29:00 +0900, Kyotaro HORIGUCHI wrote:
> At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund <andres@anarazel.de> wrote in
<20190207211008.nc3axviivmcoaluq@alap3.anarazel.de>
> > Hi,
> > 
> > On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
> > > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > > index 7eed5866d2..e52ae54821 100644
> > > --- a/src/backend/access/transam/xlog.c
> > > +++ b/src/backend/access/transam/xlog.c
> > > @@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
> > >                          &sync_secs, &sync_usecs);
> > >  
> > >      /* Accumulate checkpoint timing summary data, in milliseconds. */
> > > -    BgWriterStats.m_checkpoint_write_time +=
> > > +    BgWriterStats.checkpoint_write_time +=
> > >          write_secs * 1000 + write_usecs / 1000;
> > > -    BgWriterStats.m_checkpoint_sync_time +=
> > > +    BgWriterStats.checkpoint_sync_time +=
> > >          sync_secs * 1000 + sync_usecs / 1000;
> > 
> > Why does this patch do renames like this in the same entry as actual
> > functional changes?
> 
> Just because it is no longer "messages". I'm ok to preserve them
> as historcal names. Reverted.

It's fine to do such renames, just do them as separate patches. It's
hard enough to review changes this big...


> > >  /*
> > >   * Structures in which backends store per-table info that's waiting to be
> > > @@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
> > >   * Hash table for O(1) t_id -> tsa_entry lookup
> > >   */
> > >  static HTAB *pgStatTabHash = NULL;
> > > +static HTAB *pgStatPendingTabHash = NULL;
> > >  
> > >  /*
> > >   * Backends store per-function info that's waiting to be sent to the collector
> > >   * in this hash table (indexed by function OID).
> > >   */
> > >  static HTAB *pgStatFunctions = NULL;
> > > -
> > > -/*
> > > - * Indicates if backend has some function stats that it hasn't yet
> > > - * sent to the collector.
> > > - */
> > > -static bool have_function_stats = false;
> > > +static HTAB *pgStatPendingFunctions = NULL;
> > 
> > So this patch leaves us with a pgStatFunctions that has a comment
> > explaining it's about "waiting to be sent" stats, and then additionally
> > a pgStatPendingFunctions?
> 
> Mmm. Thanks . I changed the comment and separated pgSTatPending*
> stuff from there and merged with pgstat_pending_*. And unified
> the naming.

I think my point is larger than that - I don't see why the pending
hashtables are needed at all. They seem purely superflous.


> > 
> > > +    if (cxt->tabhash)
> > > +        dshash_detach(cxt->tabhash);
> > 
> > Huh, why do we detach here?
> 
> To release the lock on cxt->dbentry. It may be destroyed.

Uh, how?



> - Separte shared database stats from db_stats hash.
> 
> - Consider relaxing dbentry locking.
> 
> - Try removing pgStatPendingFunctions
> 
> - ispell on it.

Additionally:
- consider getting rid of all the pending stuff, not just for functions,
  - as far as I can tell it's unnecessary

Thanks,

Andres


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2019-Feb-15, Andres Freund wrote:

> It's fine to do such renames, just do them as separate patches. It's
> hard enough to review changes this big...

Talk about moving the whole file to another subdir ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
At Fri, 15 Feb 2019 17:29:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190215.172900.84235698.horiguchi.kyotaro@lab.ntt.co.jp>
> > I don't think this is all that close to being committable :(
> 
> I'm going to work harder on this. The remaining taks just now are
> the follows:
> 
> - Separte shared database stats from db_stats hash.

Finally I didn't do that. It lead to more complexy.

> - Consider relaxing dbentry locking.

Lock on a dbenty by dshash was useless to protect it from DROP
DB, so I relaxed locking on dbentry so that the dshash lock is
immediately released after fetching it.  On the other hand table
and function counter hash are just destroyed at the time of a
counter reset ant it required some kind of arbitration. I could
introduce dshash_reset() but it requires many lwlocks, which
would be too-much. Instaed, I inroduced two-set of hash_handles
and reference counter in PgStat_StatDBEntry to stash out
to-be-removed-but-currently-accessed hash. pin_hashes() and
unpin_hashes(), and reset_dbentry_counters() are that.

After all, dbentries are no longer isolated by dshash partition
lock on updates, so every dbentry instead has LWLock to do
that. (tabentries/funcentries are still isolated by dshash).

pgstat_apply_tabstats() runs single-pass. Previously ran
two-passes, shared db and my database.

We could eliminate pgStatPendingTabHash, but manipulating TSA

I'm trying removing pgStatPendingTabHash it does't work yet. I'll
include it in the next version.

> - Try removing pgStatPendingFunctions

Done. pgStatPendingDeadLocks and pgStatPendingTempfiles are also
removed.

> - ispell on it.

I fixed many misspellings..

- Fixed several silly mistakes in the previous version.

I'll post the next version soon.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b202e8a43c13514925769b8dd125dc702ff3be8e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..d1908a6137 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 951a6afc196796b37da46b2d933b1c220379311d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index d1908a6137..db8d6899af 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..fe1d4d75c5 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 5e3c97bab2effa02e8434474f7acde7e7fc8d373 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..a6c3338d40 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -328,6 +328,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -455,6 +458,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..9e6bce8f6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2856,6 +2856,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4120,6 +4123,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..a663a62fd5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1761,7 +1763,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2924,7 +2926,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3069,10 +3071,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3318,7 +3318,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3523,6 +3523,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3799,6 +3811,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5068,7 +5081,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5346,6 +5359,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..471877d2df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 697d6b001a008baae14db2b0a521a1e412788dcb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/heap/vacuumlazy.c               |    1 +
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    1 +
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    2 +
 src/backend/access/transam/xact.c                  |    3 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    1 +
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |   46 +-
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    5 +-
 src/backend/postmaster/checkpointer.c              |   17 +-
 src/backend/postmaster/pgarch.c                    |    5 +-
 src/backend/postmaster/pgstat.c                    | 6385 --------------------
 src/backend/postmaster/postmaster.c                |   86 +-
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |   15 +-
 src/backend/replication/logical/worker.c           |    5 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1781 ++++++
 src/backend/statmon/pgstat.c                       | 4009 ++++++++++++
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm.c                      |   24 +-
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/ipci.c                     |    6 +
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    5 +-
 src/backend/storage/lmgr/lwlocknames.txt           |    1 +
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |   28 +-
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |   51 +-
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/globals.c                   |    1 +
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |   15 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl       |    2 +-
 src/include/bestatus.h                             |  555 ++
 src/include/miscadmin.h                            |    2 +-
 src/include/pgstat.h                               |  954 +--
 src/include/storage/dsm.h                          |    3 +
 src/include/storage/lwlock.h                       |    3 +
 src/include/utils/timeout.h                        |    1 +
 src/test/modules/worker_spi/worker_spi.c           |    2 +-
 84 files changed, 6657 insertions(+), 7480 deletions(-)
 delete mode 100644 src/backend/postmaster/pgstat.c
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 create mode 100644 src/backend/statmon/pgstat.c
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 9cc4b2dc83..406efbd49b 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/relation.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9905593661..8523bc5300 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 239d220c24..1ea71245df 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 478a96db9b..cc511672c9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..adfd5f40fd 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31889..928d53a68c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..69cd211369 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..c2a3ed0209 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -64,9 +64,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d83fa..cf034ba333 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index ce2b61631d..8d5cbfa41d 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a28fe474aa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..bbe9c0eb5f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..0dc9f39424 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
@@ -1569,6 +1570,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     PredicateLockTwoPhaseFinish(xid, isCommit);
 
     /* Count the prepared xact as committed or aborted */
+    AtEOXact_BEStatus(isCommit);
     AtEOXact_PgStat(isCommit);
 
     /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..e5026bd261 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -2148,6 +2149,7 @@ CommitTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_BEStatus(true);
     AtEOXact_PgStat(true);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
@@ -2641,6 +2643,7 @@ AbortTransaction(void)
         AtEOXact_Files(false);
         AtEOXact_ComboCid();
         AtEOXact_HashTables(false);
+        AtEOXact_BEStatus(false);
         AtEOXact_PgStat(false);
         AtEOXact_ApplyLauncher(false);
         pgstat_report_xact_timestamp(0);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..61a90a2811 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..683c41575f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..53fa4890e9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a6c3338d40..79f624f0e0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -22,6 +22,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -328,9 +329,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -340,6 +338,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -416,6 +417,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index b79be91655..e53c0fb808 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -46,7 +46,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 5e74585d5e..03a703075e 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -41,6 +41,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 69d5a1f239..36859360b6 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 856daf6a7f..5a47eb4601 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2098708864..898a7916b0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 4490516b9e..711d929999 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index a7def3168d..fa1cf6cffa 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..f79a70d6fe 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d1177b3855..b1328d34f5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
@@ -968,7 +969,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -977,6 +978,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -992,7 +994,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1004,6 +1006,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1016,7 +1019,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1028,6 +1031,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1226,7 +1230,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,7 +1269,12 @@ do_start_worker(void)
             }
         }
         if (skipit)
+        {
+            /* Immediately free it if not used */
+            if(avdb != tmp)
+                pfree(tmp->adw_entry);
             continue;
+        }
 
         /*
          * Remember the db with oldest autovac time.  (If we are here, both
@@ -1273,7 +1282,12 @@ do_start_worker(void)
          */
         if (avdb == NULL ||
             tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
+        {
+            if (avdb)
+                pfree(avdb->adw_entry);
+
             avdb = tmp;
+        }
     }
 
     /* Found a database -- process it */
@@ -1962,7 +1976,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2012,7 +2026,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = table_open(RelationRelationId, AccessShareLock);
 
@@ -2098,6 +2112,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2177,10 +2193,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2749,12 +2766,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2786,8 +2801,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2818,6 +2833,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2908,7 +2925,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..7d7d55ef1a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..c820d35fbc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -267,9 +268,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..9f70cd0e52 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -515,13 +516,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +683,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4342ebdab4..2a7c4fd1b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -35,6 +35,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -468,7 +469,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -478,7 +479,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
deleted file mode 100644
index 9e6bce8f6a..0000000000
--- a/src/backend/postmaster/pgstat.c
+++ /dev/null
@@ -1,6385 +0,0 @@
-/* ----------
- * pgstat.c
- *
- *    All the statistics collector stuff hacked up in one big, ugly file.
- *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
- *
- *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
- *
- *    src/backend/postmaster/pgstat.c
- * ----------
- */
-#include "postgres.h"
-
-#include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
-
-#include "pgstat.h"
-
-#include "access/heapam.h"
-#include "access/htup_details.h"
-#include "access/transam.h"
-#include "access/twophase_rmgr.h"
-#include "access/xact.h"
-#include "catalog/pg_database.h"
-#include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
-#include "miscadmin.h"
-#include "pg_trace.h"
-#include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
-#include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
-#include "storage/ipc.h"
-#include "storage/latch.h"
-#include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
-#include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
-#include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
-#include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
-
-/* ----------
- * Timer definitions.
- * ----------
- */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
-
-
-/* ----------
- * The initial size hints for the hash tables used in the collector.
- * ----------
- */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
-#define PGSTAT_FUNCTION_HASH_SIZE    512
-
-
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
-/* ----------
- * GUC parameters
- * ----------
- */
-bool        pgstat_track_activities = false;
-bool        pgstat_track_counts = false;
-int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
-
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
-{
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
-
-/*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
- */
-typedef struct TabStatHashEntry
-{
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
-
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
-
-/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
- */
-static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
-/*
- * Tuple insertion/deletion counts for an open transaction can't be propagated
- * into PgStat_TableStatus counters until we know if it is going to commit
- * or abort.  Hence, we keep these counts in per-subxact structs that live
- * in TopTransactionContext.  This data structure is designed on the assumption
- * that subxacts won't usually modify very many tables.
- */
-typedef struct PgStat_SubXactStatus
-{
-    int            nest_level;        /* subtransaction nest level */
-    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
-    PgStat_TableXactStatus *first;    /* head of list for this subxact */
-} PgStat_SubXactStatus;
-
-static PgStat_SubXactStatus *pgStatXactStack = NULL;
-
-static int    pgStatXactCommit = 0;
-static int    pgStatXactRollback = 0;
-PgStat_Counter pgStatBlockReadTime = 0;
-PgStat_Counter pgStatBlockWriteTime = 0;
-
-/* Record that's written to 2PC state file when pgstat state is persisted */
-typedef struct TwoPhasePgStatRecord
-{
-    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
-    PgStat_Counter tuples_updated;    /* tuples updated in xact */
-    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
-    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
-    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
-    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
-    bool        t_truncated;    /* was the relation truncated? */
-} TwoPhasePgStatRecord;
-
-/*
- * Info about current "snapshot" of stats file
- */
-static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
-
-/*
- * Total time charged to functions so far in the current backend.
- * We use this to help separate "self" and "other" time charges.
- * (We assume this initializes to zero.)
- */
-static instr_time total_func_time;
-
-
-/* ----------
- * Local function forward declarations
- * ----------
- */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
-static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
-/* ------------------------------------------------------------
- * Public functions called from postmaster follow
- * ------------------------------------------------------------
- */
-
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
-/*
- * subroutine for pgstat_reset_all
- */
-static void
-pgstat_reset_remove_files(const char *directory)
-{
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
-
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
-}
-
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
-
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
-/* ----------
- * pgstat_vacuum_stat() -
- *
- *    Will tell the collector about objects he can get rid of.
- * ----------
- */
-void
-pgstat_vacuum_stat(void)
-{
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
-}
-
-
-/* ----------
- * pgstat_collect_oids() -
- *
- *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
- */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
-{
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
-    Relation    rel;
-    HeapScanDesc scan;
-    HeapTuple    tup;
-    Snapshot    snapshot;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    rel = table_open(catalogid, AccessShareLock);
-    snapshot = RegisterSnapshot(GetLatestSnapshot());
-    scan = heap_beginscan(rel, snapshot, 0, NULL);
-    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
-    {
-        Oid            thisoid;
-        bool        isnull;
-
-        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
-        Assert(!isnull);
-
-        CHECK_FOR_INTERRUPTS();
-
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
-    }
-    heap_endscan(scan);
-    UnregisterSnapshot(snapshot);
-    table_close(rel, AccessShareLock);
-
-    return htab;
-}
-
-
-/* ----------
- * pgstat_drop_database() -
- *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
- */
-void
-pgstat_drop_database(Oid databaseid)
-{
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_reset_counters() -
- *
- *    Tell the statistics collector to reset counters for our database.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_counters(void)
-{
-    PgStat_MsgResetcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_shared_counters() -
- *
- *    Tell the statistics collector to reset cluster-wide shared counters.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_shared_counters(const char *target)
-{
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("unrecognized reset target: \"%s\"", target),
-                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_single_counter() -
- *
- *    Tell the statistics collector to reset a single counter.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
-{
-    PgStat_MsgResetsinglecounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_report_autovac() -
- *
- *    Called from autovacuum.c to report startup of an autovacuum process.
- *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
- *    the db OID must be passed in, instead.
- * ----------
- */
-void
-pgstat_report_autovac(Oid dboid)
-{
-    PgStat_MsgAutovacStart msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ---------
- * pgstat_report_vacuum() -
- *
- *    Tell the collector about the table we just vacuumed.
- * ---------
- */
-void
-pgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
-{
-    PgStat_MsgVacuum msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_analyze() -
- *
- *    Tell the collector about the table we just analyzed.
- *
- * Caller must provide new live- and dead-tuples estimates, as well as a
- * flag indicating whether to reset the changes_since_analyze counter.
- * --------
- */
-void
-pgstat_report_analyze(Relation rel,
-                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
-                      bool resetcounter)
-{
-    PgStat_MsgAnalyze msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    /*
-     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
-     * already inserted and/or deleted rows in the target table. ANALYZE will
-     * have counted such rows as live or dead respectively. Because we will
-     * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
-     */
-    if (rel->pgstat_info != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
-        {
-            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
-        }
-        /* count stuff inserted by already-aborted subxacts, too */
-        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-        /* Since ANALYZE's counts are estimates, we could have underflowed */
-        livetuples = Max(livetuples, 0);
-        deadtuples = Max(deadtuples, 0);
-    }
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_recovery_conflict() -
- *
- *    Tell the collector about a Hot Standby recovery conflict.
- * --------
- */
-void
-pgstat_report_recovery_conflict(int reason)
-{
-    PgStat_MsgRecoveryConflict msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- *    Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
-    PgStat_MsgDeadlock msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_tempfile() -
- *
- *    Tell the collector about a temporary file.
- * --------
- */
-void
-pgstat_report_tempfile(size_t filesize)
-{
-    PgStat_MsgTempFile msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
-void
-pgstat_init_function_usage(FunctionCallInfo fcinfo,
-                           PgStat_FunctionCallUsage *fcu)
-{
-    PgStat_BackendFunctionEntry *htabent;
-    bool        found;
-
-    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
-    {
-        /* stats not wanted */
-        fcu->fs = NULL;
-        return;
-    }
-
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
-
-    fcu->fs = &htabent->f_counts;
-
-    /* save stats for this function, later used to compensate for recursion */
-    fcu->save_f_total_time = htabent->f_counts.f_total_time;
-
-    /* save current backend-wide total time */
-    fcu->save_total = total_func_time;
-
-    /* get clock time as of function start */
-    INSTR_TIME_SET_CURRENT(fcu->f_start);
-}
-
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
- *
- * If no entry, return NULL, don't create a new one
- */
-PgStat_BackendFunctionEntry *
-find_funcstat_entry(Oid func_id)
-{
-    if (pgStatFunctions == NULL)
-        return NULL;
-
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
-}
-
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
- *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
- */
-void
-pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
-{
-    PgStat_FunctionCounts *fs = fcu->fs;
-    instr_time    f_total;
-    instr_time    f_others;
-    instr_time    f_self;
-
-    /* stats not wanted? */
-    if (fs == NULL)
-        return;
-
-    /* total elapsed time in this function call */
-    INSTR_TIME_SET_CURRENT(f_total);
-    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
-
-    /* self usage: elapsed minus anything already charged to other calls */
-    f_others = total_func_time;
-    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
-    f_self = f_total;
-    INSTR_TIME_SUBTRACT(f_self, f_others);
-
-    /* update backend-wide total time */
-    INSTR_TIME_ADD(total_func_time, f_self);
-
-    /*
-     * Compute the new f_total_time as the total elapsed time added to the
-     * pre-call value of f_total_time.  This is necessary to avoid
-     * double-counting any time taken by recursive calls of myself.  (We do
-     * not need any similar kluge for self time, since that already excludes
-     * any recursive calls.)
-     */
-    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
-
-    /* update counters in function stats table */
-    if (finalize)
-        fs->f_numcalls++;
-    fs->f_total_time = f_total;
-    INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
-}
-
-
-/* ----------
- * pgstat_initstats() -
- *
- *    Initialize a relcache entry to count access statistics.
- *    Called whenever a relation is opened.
- *
- *    We assume that a relcache entry's pgstat_info field is zeroed by
- *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
- * ----------
- */
-void
-pgstat_initstats(Relation rel)
-{
-    Oid            rel_id = rel->rd_id;
-    char        relkind = rel->rd_rel->relkind;
-
-    /* We only count stats for things that have storage */
-    if (!(relkind == RELKIND_RELATION ||
-          relkind == RELKIND_MATVIEW ||
-          relkind == RELKIND_INDEX ||
-          relkind == RELKIND_TOASTVALUE ||
-          relkind == RELKIND_SEQUENCE))
-    {
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    /*
-     * If we already set up this relation in the current transaction, nothing
-     * to do.
-     */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
-        return;
-
-    /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
-}
-
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
- */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
-{
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
-}
-
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
- *
- * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
- */
-PgStat_TableStatus *
-find_tabstat_entry(Oid rel_id)
-{
-    TabStatHashEntry *hash_entry;
-
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
-
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
-}
-
-/*
- * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
- */
-static PgStat_SubXactStatus *
-get_tabstat_stack_level(int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state == NULL || xact_state->nest_level != nest_level)
-    {
-        xact_state = (PgStat_SubXactStatus *)
-            MemoryContextAlloc(TopTransactionContext,
-                               sizeof(PgStat_SubXactStatus));
-        xact_state->nest_level = nest_level;
-        xact_state->prev = pgStatXactStack;
-        xact_state->first = NULL;
-        pgStatXactStack = xact_state;
-    }
-    return xact_state;
-}
-
-/*
- * add_tabstat_xact_level - add a new (sub)transaction state record
- */
-static void
-add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-    PgStat_TableXactStatus *trans;
-
-    /*
-     * If this is the first rel to be modified at the current nest level, we
-     * first have to push a transaction stack entry.
-     */
-    xact_state = get_tabstat_stack_level(nest_level);
-
-    /* Now make a per-table stack entry */
-    trans = (PgStat_TableXactStatus *)
-        MemoryContextAllocZero(TopTransactionContext,
-                               sizeof(PgStat_TableXactStatus));
-    trans->nest_level = nest_level;
-    trans->upper = pgstat_info->trans;
-    trans->parent = pgstat_info;
-    trans->next = xact_state->first;
-    xact_state->first = trans;
-    pgstat_info->trans = trans;
-}
-
-/*
- * pgstat_count_heap_insert - count a tuple insertion of n tuples
- */
-void
-pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_inserted += n;
-    }
-}
-
-/*
- * pgstat_count_heap_update - count a tuple update
- */
-void
-pgstat_count_heap_update(Relation rel, bool hot)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_updated++;
-
-        /* t_tuples_hot_updated is nontransactional, so just advance it */
-        if (hot)
-            pgstat_info->t_counts.t_tuples_hot_updated++;
-    }
-}
-
-/*
- * pgstat_count_heap_delete - count a tuple deletion
- */
-void
-pgstat_count_heap_delete(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_deleted++;
-    }
-}
-
-/*
- * pgstat_truncate_save_counters
- *
- * Whenever a table is truncated, we save its i/u/d counters so that they can
- * be cleared, and if the (sub)xact that executed the truncate later aborts,
- * the counters can be restored to the saved (pre-truncate) values.  Note we do
- * this on the first truncate in any particular subxact level only.
- */
-static void
-pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
-{
-    if (!trans->truncated)
-    {
-        trans->inserted_pre_trunc = trans->tuples_inserted;
-        trans->updated_pre_trunc = trans->tuples_updated;
-        trans->deleted_pre_trunc = trans->tuples_deleted;
-        trans->truncated = true;
-    }
-}
-
-/*
- * pgstat_truncate_restore_counters - restore counters when a truncate aborts
- */
-static void
-pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
-{
-    if (trans->truncated)
-    {
-        trans->tuples_inserted = trans->inserted_pre_trunc;
-        trans->tuples_updated = trans->updated_pre_trunc;
-        trans->tuples_deleted = trans->deleted_pre_trunc;
-    }
-}
-
-/*
- * pgstat_count_truncate - update tuple counters due to truncate
- */
-void
-pgstat_count_truncate(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_truncate_save_counters(pgstat_info->trans);
-        pgstat_info->trans->tuples_inserted = 0;
-        pgstat_info->trans->tuples_updated = 0;
-        pgstat_info->trans->tuples_deleted = 0;
-    }
-}
-
-/*
- * pgstat_update_heap_dead_tuples - update dead-tuples count
- *
- * The semantics of this are that we are reporting the nontransactional
- * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
- * rather than increasing, and the change goes straight into the per-table
- * counter, not into transactional state.
- */
-void
-pgstat_update_heap_dead_tuples(Relation rel, int delta)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
-}
-
-
-/* ----------
- * AtEOXact_PgStat
- *
- *    Called from access/transam/xact.c at top-level transaction commit/abort.
- * ----------
- */
-void
-AtEOXact_PgStat(bool isCommit)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Count transaction commit or abort.  (We use counters, not just bools,
-     * in case the reporting message isn't sent right away.)
-     */
-    if (isCommit)
-        pgStatXactCommit++;
-    else
-        pgStatXactRollback++;
-
-    /*
-     * Transfer transactional insert/update counts into the base tabstat
-     * entries.  We don't bother to free any of the transactional state, since
-     * it's all in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            /* restore pre-truncate stats (if any) in case of aborted xact */
-            if (!isCommit)
-                pgstat_truncate_restore_counters(trans);
-            /* count attempted actions regardless of commit/abort */
-            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-            if (isCommit)
-            {
-                tabstat->t_counts.t_truncated = trans->truncated;
-                if (trans->truncated)
-                {
-                    /* forget live/dead stats seen by backend thus far */
-                    tabstat->t_counts.t_delta_live_tuples = 0;
-                    tabstat->t_counts.t_delta_dead_tuples = 0;
-                }
-                /* insert adds a live tuple, delete removes one */
-                tabstat->t_counts.t_delta_live_tuples +=
-                    trans->tuples_inserted - trans->tuples_deleted;
-                /* update and delete each create a dead tuple */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_updated + trans->tuples_deleted;
-                /* insert, update, delete each count as one change event */
-                tabstat->t_counts.t_changed_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated +
-                    trans->tuples_deleted;
-            }
-            else
-            {
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                /* an aborted xact generates no changed_tuple events */
-            }
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/* ----------
- * AtEOSubXact_PgStat
- *
- *    Called from access/transam/xact.c at subtransaction commit/abort.
- * ----------
- */
-void
-AtEOSubXact_PgStat(bool isCommit, int nestDepth)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Transfer transactional insert/update counts into the next higher
-     * subtransaction state.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL &&
-        xact_state->nest_level >= nestDepth)
-    {
-        PgStat_TableXactStatus *trans;
-        PgStat_TableXactStatus *next_trans;
-
-        /* delink xact_state from stack immediately to simplify reuse case */
-        pgStatXactStack = xact_state->prev;
-
-        for (trans = xact_state->first; trans != NULL; trans = next_trans)
-        {
-            PgStat_TableStatus *tabstat;
-
-            next_trans = trans->next;
-            Assert(trans->nest_level == nestDepth);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            if (isCommit)
-            {
-                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
-                {
-                    if (trans->truncated)
-                    {
-                        /* propagate the truncate status one level up */
-                        pgstat_truncate_save_counters(trans->upper);
-                        /* replace upper xact stats with ours */
-                        trans->upper->tuples_inserted = trans->tuples_inserted;
-                        trans->upper->tuples_updated = trans->tuples_updated;
-                        trans->upper->tuples_deleted = trans->tuples_deleted;
-                    }
-                    else
-                    {
-                        trans->upper->tuples_inserted += trans->tuples_inserted;
-                        trans->upper->tuples_updated += trans->tuples_updated;
-                        trans->upper->tuples_deleted += trans->tuples_deleted;
-                    }
-                    tabstat->trans = trans->upper;
-                    pfree(trans);
-                }
-                else
-                {
-                    /*
-                     * When there isn't an immediate parent state, we can just
-                     * reuse the record instead of going through a
-                     * palloc/pfree pushup (this works since it's all in
-                     * TopTransactionContext anyway).  We have to re-link it
-                     * into the parent level, though, and that might mean
-                     * pushing a new entry into the pgStatXactStack.
-                     */
-                    PgStat_SubXactStatus *upper_xact_state;
-
-                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
-                    trans->next = upper_xact_state->first;
-                    upper_xact_state->first = trans;
-                    trans->nest_level = nestDepth - 1;
-                }
-            }
-            else
-            {
-                /*
-                 * On abort, update top-level tabstat counts, then forget the
-                 * subtransaction
-                 */
-
-                /* first restore values obliterated by truncate */
-                pgstat_truncate_restore_counters(trans);
-                /* count attempted actions regardless of commit/abort */
-                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                tabstat->trans = trans->upper;
-                pfree(trans);
-            }
-        }
-        pfree(xact_state);
-    }
-}
-
-
-/*
- * AtPrepare_PgStat
- *        Save the transactional stats state at 2PC transaction prepare.
- *
- * In this phase we just generate 2PC records for all the pending
- * transaction-dependent stats work.
- */
-void
-AtPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-            TwoPhasePgStatRecord record;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-
-            record.tuples_inserted = trans->tuples_inserted;
-            record.tuples_updated = trans->tuples_updated;
-            record.tuples_deleted = trans->tuples_deleted;
-            record.inserted_pre_trunc = trans->inserted_pre_trunc;
-            record.updated_pre_trunc = trans->updated_pre_trunc;
-            record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
-            record.t_truncated = trans->truncated;
-
-            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
-                                   &record, sizeof(TwoPhasePgStatRecord));
-        }
-    }
-}
-
-/*
- * PostPrepare_PgStat
- *        Clean up after successful PREPARE.
- *
- * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
- *
- * Note: AtEOXact_PgStat is not called during PREPARE.
- */
-void
-PostPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * We don't bother to free any of the transactional state, since it's all
-     * in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            tabstat = trans->parent;
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/*
- * 2PC processing routine for COMMIT PREPARED case.
- *
- * Load the saved counts into our local pgstats state.
- */
-void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
-                           void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, commit case */
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_truncated = rec->t_truncated;
-    if (rec->t_truncated)
-    {
-        /* forget live/dead stats seen by backend thus far */
-        pgstat_info->t_counts.t_delta_live_tuples = 0;
-        pgstat_info->t_counts.t_delta_dead_tuples = 0;
-    }
-    pgstat_info->t_counts.t_delta_live_tuples +=
-        rec->tuples_inserted - rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_updated + rec->tuples_deleted;
-    pgstat_info->t_counts.t_changed_tuples +=
-        rec->tuples_inserted + rec->tuples_updated +
-        rec->tuples_deleted;
-}
-
-/*
- * 2PC processing routine for ROLLBACK PREPARED case.
- *
- * Load the saved counts into our local pgstats state, but treat them
- * as aborted.
- */
-void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
-                          void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, abort case */
-    if (rec->t_truncated)
-    {
-        rec->tuples_inserted = rec->inserted_pre_trunc;
-        rec->tuples_updated = rec->updated_pre_trunc;
-        rec->tuples_deleted = rec->deleted_pre_trunc;
-    }
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_inserted + rec->tuples_updated;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
-/* ----------
- * pgstat_fetch_stat_tabentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
- *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatTabEntry *
-pgstat_fetch_stat_tabentry(Oid relid)
-{
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    /*
-     * If we didn't find it, maybe it's a shared table.
-     */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_funcentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one function or NULL.
- * ----------
- */
-PgStat_StatFuncEntry *
-pgstat_fetch_stat_funcentry(Oid func_id)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
-/*
- * ---------
- * pgstat_fetch_stat_archiver() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
- * ---------
- */
-PgStat_ArchiverStats *
-pgstat_fetch_stat_archiver(void)
-{
-    backend_read_statsfile();
-
-    return &archiverStats;
-}
-
-
-/*
- * ---------
- * pgstat_fetch_global() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
- * ---------
- */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
-{
-    backend_read_statsfile();
-
-    return &globalStats;
-}
-
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
-        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
-        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
-/*
- * Shut down a single backend's statistics reporting at process exit.
- *
- * Flush any remaining statistics counts out to the collector.
- * Without this, operations triggered during backend exit (such as
- * temp table deletions) won't be counted.
- *
- * Lastly, clear out our entry in the PgBackendStatus array.
- */
-static void
-pgstat_beshutdown_hook(int code, Datum arg)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    /*
-     * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
-     * database ID, so forget it.  (This means that accesses to pg_database
-     * during failed backend starts might never get counted.)
-     */
-    if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(!pgStatRunningInCollector);
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
-/* ------------------------------------------------------------
- * Local support functions follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
-
-    /*
-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
-     */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
-        return;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
-
-    /*
-     * Clear out the statistics buffer, so it can be re-used.
-     */
-    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
- *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
- */
-char *
-pgstat_clip_activity(const char *raw_activity)
-{
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
-}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a663a62fd5..a01b81a594 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
@@ -255,7 +256,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1302,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1756,11 +1749,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2595,8 +2583,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2927,8 +2913,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2995,13 +2979,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3076,22 +3053,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3550,22 +3511,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3761,8 +3706,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3801,8 +3744,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4003,8 +3945,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4977,18 +4917,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5101,12 +5029,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5976,7 +5898,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6029,8 +5950,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6262,7 +6181,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..3accdf7bcf 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a6fdba3f41..0de04159d5 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..e30b2dbcf0 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7027737e67..75a3208f74 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 55b91b5e12..ea1c7e643e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index dad2b3d065..dbb7c57ebc 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/htup_details.h"
 #include "access/table.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2b486b5e9f..b6d6013dd0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -62,10 +62,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ad44b2bf43..1b792f6626 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 28f5fc23aa..58c94794bc 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,26 +86,28 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/table.h"
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -128,7 +130,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(true);
     }
 
     /* And flush all writes. */
@@ -144,6 +146,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_update_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -525,7 +530,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
     }
 }
 
@@ -863,7 +868,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_update_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9516515bc..dc675778e3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -26,6 +26,7 @@
 #include "access/table.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -477,7 +478,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1311,6 +1312,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_update_stat(false);
         }
     }
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..c60e69302a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6..02ec91d98e 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..bdca25499d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9b143f361b..2b38c0c4f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..292312d05c
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1781 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void bestatus_clear_snapshot(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
+        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
+        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * AtEOXact_BEStatus
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_BEStatus(bool isCommit)
+{
+    bestatus_clear_snapshot();
+}
+
+/*
+ * AtPrepare_BEStatus
+ *        Clear existing snapshot at 2PC transaction prepare.
+ */
+void
+AtPrepare_BEStatus(void)
+{
+    bestatus_clear_snapshot();
+}
+
+/* ----------
+ * bestatus_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+static void
+bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
new file mode 100644
index 0000000000..9dcd7ab540
--- /dev/null
+++ b/src/backend/statmon/pgstat.c
@@ -0,0 +1,4009 @@
+/* ----------
+ * pgstat.c
+ *
+ *    Statistics collector facility.
+ *
+ *  Collects per-table and per-function usage statistics of backends and shares
+ *  them among all backends via shared memory. Every backend records
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_report_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared
+ *  memory. To avoid congestion on the shared memory, we do that not often
+ *  than PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, such numbers are
+ *  postponed to the next chances with the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
+ * 
+ *  pgstat_fetch_stat_*() are used to read the statistics numbers. There are
+ *  two ways of reading the shared statistics. Transactional and
+ *  one-shot. Retrieved numbers are stored in local hash which persists until
+ *  transaction-end in the former type. One the other hand autovacuum, which
+ *  doesn't need such characteristics, uses one-shot mode, which just copies
+ *  the data into palloc'ed memory.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/pgstat.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "pgstat.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase_rmgr.h"
+#include "access/xact.h"
+#include "bestatus.h"
+#include "catalog/pg_database.h"
+#include "catalog/pg_proc.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+/* ----------
+ * Timer definitions.
+ * ----------
+ */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
+
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
+
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
+
+/* ----------
+ * The initial size hints for the hash tables used in the collector.
+ * ----------
+ */
+#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_FUNCTION_HASH_SIZE    512
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum PgStat_TableLookupState
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} PgStat_TableLookupState;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_counts = false;
+int            pgstat_track_functions = TRACK_FUNC_OFF;
+
+/* ----------
+ * Built from GUC parameter
+ * ----------
+ */
+char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
+char       *pgstat_stat_filename = NULL;
+char       *pgstat_stat_tmpname = NULL;
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
+/*
+ * BgWriter global statistics counters (unused in other processes).
+ * Stored directly in a stats message structure so it can be sent
+ * without needing to copy things around.  We assume this inits to zeroes.
+ */
+PgStat_BgWriter BgWriterStats;
+
+/* ----------
+ * Local data
+ * ----------
+ */
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+
+/* memory context for snapshots */
+static MemoryContext pgStatLocalContext = NULL;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
+/*
+ * Structures in which backends store per-table info that's waiting to be
+ * written to shared stats.
+ *
+ * NOTE: once allocated, TabStatusArray structures are never moved or deleted
+ * for the life of the backend.  Also, we zero out the t_id fields of the
+ * contained PgStat_TableStatus structs whenever they are not actively in use.
+ * This allows relcache pgstat_info pointers to be treated as long-lived data,
+ * avoiding repeated searches in pgstat_initstats() when a relation is
+ * repeatedly opened during a transaction.
+ */
+#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
+
+typedef struct TabStatusArray
+{
+    struct TabStatusArray *tsa_next;    /* link to next array, if any */
+    int            tsa_used;        /* # entries currently used */
+    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
+} TabStatusArray;
+
+static TabStatusArray *pgStatTabList = NULL;
+
+/*
+ * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ */
+typedef struct TabStatHashEntry
+{
+    Oid            t_id;
+    PgStat_TableStatus *tsa_entry;
+} TabStatHashEntry;
+
+/*
+ * Hash table for O(1) t_id -> tsa_entry lookup
+ */
+static HTAB *pgStatTabHash = NULL;
+
+/*
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
+ */
+static HTAB *pgStatFunctions = NULL;
+
+/*
+ *  variables to signal that the backend has each info that's waiting to be
+ *  flushed out to shared memory.
+ */
+static bool  pgStatPendingRecoveryConflicts = false;
+static HTAB *pgStatPendingTabHash = NULL;
+static int pending_deadlocks = 0;
+static size_t pending_filesize = 0;
+static size_t pending_files = 0;
+
+/*
+ * Tuple insertion/deletion counts for an open transaction can't be propagated
+ * into PgStat_TableStatus counters until we know if it is going to commit
+ * or abort.  Hence, we keep these counts in per-subxact structs that live
+ * in TopTransactionContext.  This data structure is designed on the assumption
+ * that subxacts won't usually modify very many tables.
+ */
+typedef struct PgStat_SubXactStatus
+{
+    int            nest_level;        /* subtransaction nest level */
+    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
+    PgStat_TableXactStatus *first;    /* head of list for this subxact */
+} PgStat_SubXactStatus;
+
+static PgStat_SubXactStatus *pgStatXactStack = NULL;
+
+static int    pgStatXactCommit = 0;
+static int    pgStatXactRollback = 0;
+PgStat_Counter pgStatBlockReadTime = 0;
+PgStat_Counter pgStatBlockWriteTime = 0;
+
+/* Record that's written to 2PC state file when pgstat state is persisted */
+typedef struct TwoPhasePgStatRecord
+{
+    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
+    PgStat_Counter tuples_updated;    /* tuples updated in xact */
+    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
+    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
+    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
+    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
+    Oid            t_id;            /* table's OID */
+    bool        t_shared;        /* is it a shared catalog? */
+    bool        t_truncated;    /* was the relation truncated? */
+} TwoPhasePgStatRecord;
+
+typedef struct
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_apply_tabstat_context;
+
+/*
+ * Info about current snapshot of stats
+ */
+TimestampTz backend_cache_expire = 0; /* local cache expiration time */
+bool        first_in_xact = true;      /* first fetch after the last tr end */
+
+/*
+ * Cluster wide statistics.
+
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by statistics collector code and
+ * snapshot_* are cached stats for the reader code.
+ */
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
+/*
+ * Total time charged to functions so far in the current backend.
+ * We use this to help separate "self" and "other" time charges.
+ * (We assume this initializes to zero.)
+ */
+static instr_time total_func_time;
+
+
+/* ----------
+ * Local function forward declarations
+ * ----------
+ */
+/* functions used in backends */
+static void pgstat_beshutdown_hook(int code, Datum arg);
+
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupState *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_snapshot_global_stats(void);
+static PgStat_StatFuncEntry *backend_get_func_entry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+
+static void pgstat_postmaster_shutdown(int code, Datum arg);
+static void pgstat_apply_tabstats(bool force,
+                                          pgstat_apply_tabstat_context *cxt);
+static bool pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                                 PgStat_TableStatus *entry, bool nowait);
+static void pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                                          PgStat_TableStatus *srcstat,
+                                          bool init);
+static void pgstat_update_funcstats(bool force,
+                                    pgstat_apply_tabstat_context *cxt);
+static void pgstat_reset_all_counters(void);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+
+static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry,
+                                   bool initialize);
+static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_setup_memcxt(void);
+
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
+
+/* ------------------------------------------------------------
+ * Public functions called from postmaster follow
+ * ------------------------------------------------------------
+ */
+
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * subroutine for pgstat_reset_all
+ */
+static void
+pgstat_reset_remove_files(const char *directory)
+{
+    DIR           *dir;
+    struct dirent *entry;
+    char        fname[MAXPGPATH * 2];
+
+    dir = AllocateDir(directory);
+    while ((entry = ReadDir(dir, directory)) != NULL)
+    {
+        int            nchars;
+        Oid            tmp_oid;
+
+        /*
+         * Skip directory entries that don't match the file names we write.
+         * See get_dbstat_filename for the database-specific pattern.
+         */
+        if (strncmp(entry->d_name, "global.", 7) == 0)
+            nchars = 7;
+        else
+        {
+            nchars = 0;
+            (void) sscanf(entry->d_name, "db_%u.%n",
+                          &tmp_oid, &nchars);
+            if (nchars <= 0)
+                continue;
+            /* %u allows leading whitespace, so reject that */
+            if (strchr("0123456789", entry->d_name[3]) == NULL)
+                continue;
+        }
+
+        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
+            strcmp(entry->d_name + nchars, "stat") != 0)
+            continue;
+
+        snprintf(fname, sizeof(fname), "%s/%s", directory,
+                 entry->d_name);
+        unlink(fname);
+    }
+    FreeDir(dir);
+}
+
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    /* create the database hash */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    MemoryContextSwitchTo(pgStatLocalContext);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+}
+
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up by calling
+ *    this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *    has elapsed since last cleanup. On the other hand updates by regular
+ *    backends happen with the interval not shorter than
+ *    PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns time in milliseconds until the next update time.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
+ *    transaction stop time as an approximation of current time. 
+ *    ----------
+ */
+long
+pgstat_update_stat(bool force)
+{
+    static TimestampTz last_report = 0;
+    static TimestampTz oldest_pending = 0;
+    TimestampTz now;
+    TabStatusArray *tsa;
+    pgstat_apply_tabstat_context cxt = {0};
+    bool        other_pending_stats = false;
+    long elapsed;
+    long secs;
+    int     usecs;
+
+    /*
+     * We try to flush any local data waiting to flushed out to shared memory.
+     */
+    if (pgStatPendingRecoveryConflicts ||
+        pending_deadlocks != 0 || pending_files != 0 ||    pgStatFunctions ||
+        pgStatPendingTabHash)
+        other_pending_stats = true;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!other_pending_stats &&
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    last_report = now;
+
+    /* Publish report time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_report)
+        StatsShmem->last_update = last_report;
+    LWLockRelease(StatsLock);
+    
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to.  (Should we fail partway through the loop below,
+     * it's okay to have removed the hashtable already --- the only
+     * consequence is we'd get multiple entries for the same table in the
+     * pgStatTabList, and that's safe.)
+     */
+    if (pgStatTabHash)
+        hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    /* Flush out table stats including pending stats */
+    pgstat_apply_tabstats(force, &cxt);
+
+    /* zero out TableStatus structs after use */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        MemSet(tsa->tsa_entries, 0,
+               tsa->tsa_used * sizeof(PgStat_TableStatus));
+        tsa->tsa_used = 0;
+    }
+
+    /* Try flush out if we have any. */
+    if (other_pending_stats)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+
+        /* get dbentry if not yet */
+        if (cxt.mydbentry == NULL)
+        {
+            if (!force)
+                op |= PGSTAT_FETCH_NOWAIT;
+
+            cxt.mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+            /* retry after the interval on lock failure */
+            if (cxt.mydbentry == NULL)
+                return PGSTAT_STAT_RETRY_INTERVAL;
+
+            cxt.mygeneration = pin_hashes(cxt.mydbentry);
+        }
+
+        if (cxt.mydbentry)
+        {
+            /* clean up pending statistics if any */
+            if (pgStatFunctions)
+                pgstat_update_funcstats(force, &cxt);
+
+            LWLockAcquire(&cxt.mydbentry->lock, LW_EXCLUSIVE);
+            if (pgStatPendingRecoveryConflicts)
+                pgstat_cleanup_recovery_conflict(cxt.mydbentry);
+            if (pending_deadlocks != 0)
+                pgstat_cleanup_deadlock(cxt.mydbentry);
+            if (pending_files != 0)
+                pgstat_cleanup_tempfile(cxt.mydbentry);
+            LWLockRelease(&cxt.mydbentry->lock);
+        }
+    }
+
+    if (cxt.mydbentry)
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+
+    /* record oldest pending update time */
+    if (pgStatPendingTabHash == NULL)
+        oldest_pending = 0;
+    else if (oldest_pending == 0)
+        oldest_pending = now;
+
+    /* retry after PGSTAT_REPORT_STAT if any pending data */
+    return oldest_pending > 0 ? PGSTAT_STAT_RETRY_INTERVAL : 0;
+}
+
+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Applies table stats in table status array merging with pending stats if
+ * any.  If force is true waits until required locks to be acquired. Otherwise
+ * stats merged stats as pending stats and it will be processed at the next
+ * chance.
+ */
+static void
+pgstat_apply_tabstats(bool force, pgstat_apply_tabstat_context *cxt)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    int i;
+
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+            PgStat_TableStatus *pentry = NULL;
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) == 0)
+                continue;
+
+            /* if pending update exists, it should be applied along with */
+            if (pgStatPendingTabHash != NULL)
+            {
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_FIND, NULL);
+
+                if (pentry)
+                {
+                    /* merge new update into pending updates */
+                    pgstat_merge_tabentry(pentry, entry, false);
+                    entry = pentry;
+                }
+            }
+
+            /* try to apply the merged stats */
+            if (pgstat_apply_tabstat(cxt, entry, !force))
+            {
+                /* succeeded. remove it if it was pending stats */
+                if (pentry)
+                    hash_search(pgStatPendingTabHash,
+                                (void *) pentry, HASH_REMOVE, NULL);
+            }
+            else if (!pentry)
+            {
+                /* failed and there was no pending entry, create new one. */
+                bool found;
+
+                if (pgStatPendingTabHash == NULL)
+                {
+                    HASHCTL        ctl;
+
+                    memset(&ctl, 0, sizeof(ctl));
+                    ctl.keysize = sizeof(Oid);
+                    ctl.entrysize = sizeof(PgStat_TableStatus);
+                    pgStatPendingTabHash =
+                        hash_create("pgstat pending table stats hash",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+                }
+
+                pentry = hash_search(pgStatPendingTabHash,
+                                     (void *) entry, HASH_ENTER, &found);
+                Assert (!found);
+
+                *pentry = *entry;
+            }
+        }
+    }
+
+    /* if any pending stats exists, try to clean it up */
+    if (pgStatPendingTabHash != NULL)
+    {
+        HASH_SEQ_STATUS pstat;
+        PgStat_TableStatus *pentry;
+
+        hash_seq_init(&pstat, pgStatPendingTabHash);
+        while((pentry = (PgStat_TableStatus *) hash_seq_search(&pstat)) != NULL)
+        {
+            /* apply pending entry and remove on success */
+            if (pgstat_apply_tabstat(cxt, pentry, !force))
+                hash_search(pgStatPendingTabHash,
+                            (void *) pentry, HASH_REMOVE, NULL);
+        }
+
+        /* destroy the hash if no entry is left */
+        if (hash_get_num_entries(pgStatPendingTabHash) == 0)
+        {
+            hash_destroy(pgStatPendingTabHash);
+            pgStatPendingTabHash = NULL;
+        }
+    }
+
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    /* Don't release cxt->mydb_tabhash. It may be used later */
+}
+
+
+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+                     PgStat_TableStatus *entry, bool nowait)
+{
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
+    {
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+
+        /*
+         * We don't keep lwlock on dbentries, since both shared dbentry and
+         * the dbentry of mine cannot be dropped meanwhile. We will use
+         * generation to isolate resetted table/function hashes.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            /*
+             * Store dshash entry for my database to use later. This might
+             * seem dangerous but the database entry cannot be removed as long
+             * as this session is living. Counters are updated safely since
+             * they are atomics. Table dshashes can be removed but the result
+             * is just losing updates this time.
+             */
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /* Update database-wide stats  */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
+    }
+    else
+    {
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
+    }        
+
+
+    /*
+     * If we have access to the required data, try update table stats first.
+     * Update database stats only if the first step succeeded.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
+}
+
+/*
+ * pgstat_merge_tabentry: subroutine for pgstat_update_stat
+ *
+ * Merge srcstat into deststat. Existing value in deststat is cleared if
+ * init is true.
+ */
+static void
+pgstat_merge_tabentry(PgStat_TableStatus *deststat,
+                      PgStat_TableStatus *srcstat,
+                      bool init)
+{
+    Assert (deststat != srcstat);
+
+    if (init)
+        deststat->t_counts = srcstat->t_counts;
+    else
+    {
+        PgStat_TableCounts *dest = &deststat->t_counts;
+        PgStat_TableCounts *src = &srcstat->t_counts;
+
+        dest->t_numscans += src->t_numscans;
+        dest->t_tuples_returned += src->t_tuples_returned;
+        dest->t_tuples_fetched += src->t_tuples_fetched;
+        dest->t_tuples_inserted += src->t_tuples_inserted;
+        dest->t_tuples_updated += src->t_tuples_updated;
+        dest->t_tuples_deleted += src->t_tuples_deleted;
+        dest->t_tuples_hot_updated += src->t_tuples_hot_updated;
+        dest->t_truncated |= src->t_truncated;
+
+        /* If table was truncated, first reset the live/dead counters */
+        if (src->t_truncated)
+        {
+            dest->t_delta_live_tuples = 0;
+            dest->t_delta_dead_tuples = 0;
+        }
+        dest->t_delta_live_tuples += src->t_delta_live_tuples;
+        dest->t_delta_dead_tuples += src->t_delta_dead_tuples;
+        dest->t_changed_tuples += src->t_changed_tuples;
+        dest->t_blocks_fetched += src->t_blocks_fetched;
+        dest->t_blocks_hit += src->t_blocks_hit;
+    }
+}
+
+/*
+ * pgstat_update_funcstats: subroutine for pgstat_update_stat
+ *
+ *  updates a function stat
+ */
+static void
+pgstat_update_funcstats(bool force, pgstat_apply_tabstat_context *cxt)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    dshash_table *funchash;
+    bool          nowait = !force;
+    HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
+
+    if (pgStatFunctions == NULL)
+        return;
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+
+    /* no longer need to apply this, discard it */
+    if (funchash == NULL)
+    {
+        hash_destroy(pgStatFunctions);
+        pgStatFunctions = NULL;
+        return;
+    }
+
+    /*
+     * First, we empty the transaction stats. Just move numbers to pending
+     * stats if any. Otherwise try to directly update the shared stats but
+     * create a new pending entry on lock failure.
+     */
+
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated since last time */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
+                   sizeof(PgStat_FunctionCounts)) == 0)
+            continue;
+
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    }
+}
+
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *    Remove objects he can get rid of.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* If not done for this transaction, take a snapshot of stats */
+    if (!backend_snapshot_global_stats())
+        return;
+
+    /*
+     * Read pg_database and make a list of OIDs of all existing databases
+     */
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+
+    /*
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
+     */
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            dbid = dbentry->databaseid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /* the DB entry for shared tables (with InvalidOid) is never dropped */
+        if (OidIsValid(dbid) &&
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            pgstat_drop_database(dbid);
+    }
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Lookup our own database entry; if not found, nothing more to do.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
+        return;
+
+    /*
+     * Similarly to above, make a list of all known relations in this DB.
+     */
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+
+    /*
+     * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
+     */
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            tabid = tabentry->tableid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
+    }
+    dshash_detach(dshtable);
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Now repeat the above steps for functions.  However, we needn't bother
+     * in the common case where no function stats are being collected.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshtable = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
+        {
+            Oid            funcid = funcentry->functionid;
+
+            CHECK_FOR_INTERRUPTS();
+
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+                continue;
+
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
+        }
+
+        hash_destroy(oidtab);
+
+        dshash_detach(dshtable);
+    }
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/*
+ * pgstat_collect_oids() -
+ *
+ *    Collect the OIDs of all objects listed in the specified system catalog
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
+ */
+static HTAB *
+pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+{
+    HTAB       *htab;
+    HASHCTL        hash_ctl;
+    Relation    rel;
+    HeapScanDesc scan;
+    HeapTuple    tup;
+    Snapshot    snapshot;
+
+    memset(&hash_ctl, 0, sizeof(hash_ctl));
+    hash_ctl.keysize = sizeof(Oid);
+    hash_ctl.entrysize = sizeof(Oid);
+    hash_ctl.hcxt = CurrentMemoryContext;
+    htab = hash_create("Temporary table of OIDs",
+                       PGSTAT_TAB_HASH_SIZE,
+                       &hash_ctl,
+                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    rel = heap_open(catalogid, AccessShareLock);
+    snapshot = RegisterSnapshot(GetLatestSnapshot());
+    scan = heap_beginscan(rel, snapshot, 0, NULL);
+    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
+    {
+        Oid            thisoid;
+        bool        isnull;
+
+        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
+        Assert(!isnull);
+
+        CHECK_FOR_INTERRUPTS();
+
+        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+    }
+    heap_endscan(scan);
+    UnregisterSnapshot(snapshot);
+    heap_close(rel, AccessShareLock);
+
+    return htab;
+}
+
+
+/* ----------
+ * pgstat_drop_database() -
+ *
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
+ * ----------
+ */
+void
+pgstat_drop_database(Oid databaseid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
+}
+
+
+/* ----------
+ * pgstat_reset_counters() -
+ *
+ *    Reset counters for our database.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_counters(void)
+{
+    PgStat_StatDBEntry           *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /*
+     * Reset database-level stats, too.  This creates empty hash tables for
+     * tables and functions.
+     */
+    reset_dbentry_counters(dbentry, false);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* ----------
+ * pgstat_reset_shared_counters() -
+ *
+ *    Reset cluster-wide shared counters.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_shared_counters(const char *target)
+{
+    Assert(db_stats);
+
+    /* Reset the archiver statistics for the cluster. */
+    if (strcmp(target, "archiver") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else if (strcmp(target, "bgwriter") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("unrecognized reset target: \"%s\"", target),
+                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_reset_single_counter() -
+ *
+ *    Reset a single counter.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
+{
+    PgStat_StatDBEntry *dbentry;
+    
+
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
+        return;
+
+    /* Set the reset timestamp for the whole database */
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    /* Remove object if it exists, ignore it if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+    }
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry, true);
+        dshash_release_lock(db_stats, dbentry);
+    }
+
+    /*
+     * Reset global counters
+     */
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_report_autovac() -
+ *
+ *    Called from autovacuum.c to report startup of an autovacuum process.
+ *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
+ *    the db OID must be passed in, instead.
+ * ----------
+ */
+void
+pgstat_report_autovac(Oid dboid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/* ---------
+ * pgstat_report_vacuum() -
+ *
+ *    Report about the table we just vacuumed.
+ * ---------
+ */
+void
+pgstat_report_vacuum(Oid tableoid, bool shared,
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_analyze() -
+ *
+ *    Report about the table we just analyzed.
+ *
+ * Caller must provide new live- and dead-tuples estimates, as well as a
+ * flag indicating whether to reset the changes_since_analyze counter.
+ * --------
+ */
+void
+pgstat_report_analyze(Relation rel,
+                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                      bool resetcounter)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
+     * already inserted and/or deleted rows in the target table. ANALYZE will
+     * have counted such rows as live or dead respectively. Because we will
+     * report our counts of such rows at transaction end, we should subtract
+     * off these counts from what we send to the collector now, else they'll
+     * be double-counted after commit.  (This approach also ensures that the
+     * collector ends up with the right numbers if we abort instead of
+     * committing.)
+     */
+    if (rel->pgstat_info != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+        {
+            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+        }
+        /* count stuff inserted by already-aborted subxacts, too */
+        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+        /* Since ANALYZE's counts are estimates, we could have underflowed */
+        livetuples = Max(livetuples, 0);
+        deadtuples = Max(deadtuples, 0);
+    }
+
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/* --------
+ * pgstat_report_recovery_conflict() -
+ *
+ *    Report a Hot Standby recovery conflict.
+ * --------
+ */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
+void
+pgstat_report_recovery_conflict(int reason)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgStatPendingRecoveryConflicts = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pgStatPendingRecoveryConflicts = false;
+}
+
+/* --------
+ * pgstat_report_deadlock() -
+ *
+ *    Report a deadlock detected.
+ * --------
+ */
+void
+pgstat_report_deadlock(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pending_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+}
+
+/* --------
+ * pgstat_report_tempfile() -
+ *
+ *    Report a temporary file.
+ * --------
+ */
+void
+pgstat_report_tempfile(size_t filesize)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+
+    if (pending_files == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for temporary files
+ */
+static void
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
+{
+
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+
+}
+
+/*
+ * Initialize function call usage data.
+ * Called by the executor before invoking a function.
+ */
+void
+pgstat_init_function_usage(FunctionCallInfo fcinfo,
+                           PgStat_FunctionCallUsage *fcu)
+{
+    PgStat_BackendFunctionEntry *htabent;
+    bool        found;
+
+    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
+    {
+        /* stats not wanted */
+        fcu->fs = NULL;
+        return;
+    }
+
+    if (!pgStatFunctions)
+    {
+        /* First time through - initialize function stat table */
+        HASHCTL        hash_ctl;
+
+        memset(&hash_ctl, 0, sizeof(hash_ctl));
+        hash_ctl.keysize = sizeof(Oid);
+        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
+        pgStatFunctions = hash_create("Function stat entries",
+                                      PGSTAT_FUNCTION_HASH_SIZE,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Get the stats entry for this function, create if necessary */
+    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
+                          HASH_ENTER, &found);
+    if (!found)
+        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+
+    fcu->fs = &htabent->f_counts;
+
+    /* save stats for this function, later used to compensate for recursion */
+    fcu->save_f_total_time = htabent->f_counts.f_total_time;
+
+    /* save current backend-wide total time */
+    fcu->save_total = total_func_time;
+
+    /* get clock time as of function start */
+    INSTR_TIME_SET_CURRENT(fcu->f_start);
+}
+
+/*
+ * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
+ *        for specified function
+ *
+ * If no entry, return NULL, don't create a new one
+ */
+PgStat_BackendFunctionEntry *
+find_funcstat_entry(Oid func_id)
+{
+    if (pgStatFunctions == NULL)
+        return NULL;
+
+    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
+                                                       (void *) &func_id,
+                                                       HASH_FIND, NULL);
+}
+
+/*
+ * Calculate function call usage and update stat counters.
+ * Called by the executor after invoking a function.
+ *
+ * In the case of a set-returning function that runs in value-per-call mode,
+ * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ * calls for what the user considers a single call of the function.  The
+ * finalize flag should be TRUE on the last call.
+ */
+void
+pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
+{
+    PgStat_FunctionCounts *fs = fcu->fs;
+    instr_time    f_total;
+    instr_time    f_others;
+    instr_time    f_self;
+
+    /* stats not wanted? */
+    if (fs == NULL)
+        return;
+
+    /* total elapsed time in this function call */
+    INSTR_TIME_SET_CURRENT(f_total);
+    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
+
+    /* self usage: elapsed minus anything already charged to other calls */
+    f_others = total_func_time;
+    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
+    f_self = f_total;
+    INSTR_TIME_SUBTRACT(f_self, f_others);
+
+    /* update backend-wide total time */
+    INSTR_TIME_ADD(total_func_time, f_self);
+
+    /*
+     * Compute the new f_total_time as the total elapsed time added to the
+     * pre-call value of f_total_time.  This is necessary to avoid
+     * double-counting any time taken by recursive calls of myself.  (We do
+     * not need any similar kluge for self time, since that already excludes
+     * any recursive calls.)
+     */
+    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
+
+    /* update counters in function stats table */
+    if (finalize)
+        fs->f_numcalls++;
+    fs->f_total_time = f_total;
+    INSTR_TIME_ADD(fs->f_self_time, f_self);
+}
+
+
+/* ----------
+ * pgstat_initstats() -
+ *
+ *    Initialize a relcache entry to count access statistics.
+ *    Called whenever a relation is opened.
+ *
+ *    We assume that a relcache entry's pgstat_info field is zeroed by
+ *    relcache.c when the relcache entry is made; thereafter it is long-lived
+ *    data.  We can avoid repeated searches of the TabStatus arrays when the
+ *    same relation is touched repeatedly within a transaction.
+ * ----------
+ */
+void
+pgstat_initstats(Relation rel)
+{
+    Oid            rel_id = rel->rd_id;
+    char        relkind = rel->rd_rel->relkind;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /* We only count stats for things that have storage */
+    if (!(relkind == RELKIND_RELATION ||
+          relkind == RELKIND_MATVIEW ||
+          relkind == RELKIND_INDEX ||
+          relkind == RELKIND_TOASTVALUE ||
+          relkind == RELKIND_SEQUENCE))
+    {
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /*
+     * If we already set up this relation in the current transaction, nothing
+     * to do.
+     */
+    if (rel->pgstat_info != NULL &&
+        rel->pgstat_info->t_id == rel_id)
+        return;
+
+    /* Else find or make the PgStat_TableStatus entry, and update link */
+    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+}
+
+/*
+ * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ */
+static PgStat_TableStatus *
+get_tabstat_entry(Oid rel_id, bool isshared)
+{
+    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *entry;
+    TabStatusArray *tsa;
+    bool        found;
+
+    /*
+     * Create hash table if we don't have it already.
+     */
+    if (pgStatTabHash == NULL)
+    {
+        HASHCTL        ctl;
+
+        memset(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(Oid);
+        ctl.entrysize = sizeof(TabStatHashEntry);
+
+        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
+                                    TABSTAT_QUANTUM,
+                                    &ctl,
+                                    HASH_ELEM | HASH_BLOBS);
+    }
+
+    /*
+     * Find an entry or create a new one.
+     */
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    if (!found)
+    {
+        /* initialize new entry with null pointer */
+        hash_entry->tsa_entry = NULL;
+    }
+
+    /*
+     * If entry is already valid, we're done.
+     */
+    if (hash_entry->tsa_entry)
+        return hash_entry->tsa_entry;
+
+    /*
+     * Locate the first pgStatTabList entry with free space, making a new list
+     * entry if needed.  Note that we could get an OOM failure here, but if so
+     * we have left the hashtable and the list in a consistent state.
+     */
+    if (pgStatTabList == NULL)
+    {
+        /* Set up first pgStatTabList entry */
+        pgStatTabList = (TabStatusArray *)
+            MemoryContextAllocZero(TopMemoryContext,
+                                   sizeof(TabStatusArray));
+    }
+
+    tsa = pgStatTabList;
+    while (tsa->tsa_used >= TABSTAT_QUANTUM)
+    {
+        if (tsa->tsa_next == NULL)
+            tsa->tsa_next = (TabStatusArray *)
+                MemoryContextAllocZero(TopMemoryContext,
+                                       sizeof(TabStatusArray));
+        tsa = tsa->tsa_next;
+    }
+
+    /*
+     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
+     * the entry was already zeroed, either at creation or after last use.
+     */
+    entry = &tsa->tsa_entries[tsa->tsa_used++];
+    entry->t_id = rel_id;
+    entry->t_shared = isshared;
+
+    /*
+     * Now we can fill the entry in pgStatTabHash.
+     */
+    hash_entry->tsa_entry = entry;
+
+    return entry;
+}
+
+/*
+ * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+ *
+ * If no entry, return NULL, don't create a new one
+ *
+ * Note: if we got an error in the most recent execution of pgstat_report_stat,
+ * it's possible that an entry exists but there's no hashtable entry for it.
+ * That's okay, we'll treat this case as "doesn't exist".
+ */
+PgStat_TableStatus *
+find_tabstat_entry(Oid rel_id)
+{
+    TabStatHashEntry *hash_entry;
+
+    /* If hashtable doesn't exist, there are no entries at all */
+    if (!pgStatTabHash)
+        return NULL;
+
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
+    if (!hash_entry)
+        return NULL;
+
+    /* Note that this step could also return NULL, but that's correct */
+    return hash_entry->tsa_entry;
+}
+
+/*
+ * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
+ */
+static PgStat_SubXactStatus *
+get_tabstat_stack_level(int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state == NULL || xact_state->nest_level != nest_level)
+    {
+        xact_state = (PgStat_SubXactStatus *)
+            MemoryContextAlloc(TopTransactionContext,
+                               sizeof(PgStat_SubXactStatus));
+        xact_state->nest_level = nest_level;
+        xact_state->prev = pgStatXactStack;
+        xact_state->first = NULL;
+        pgStatXactStack = xact_state;
+    }
+    return xact_state;
+}
+
+/*
+ * add_tabstat_xact_level - add a new (sub)transaction state record
+ */
+static void
+add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+    PgStat_TableXactStatus *trans;
+
+    /*
+     * If this is the first rel to be modified at the current nest level, we
+     * first have to push a transaction stack entry.
+     */
+    xact_state = get_tabstat_stack_level(nest_level);
+
+    /* Now make a per-table stack entry */
+    trans = (PgStat_TableXactStatus *)
+        MemoryContextAllocZero(TopTransactionContext,
+                               sizeof(PgStat_TableXactStatus));
+    trans->nest_level = nest_level;
+    trans->upper = pgstat_info->trans;
+    trans->parent = pgstat_info;
+    trans->next = xact_state->first;
+    xact_state->first = trans;
+    pgstat_info->trans = trans;
+}
+
+/*
+ * pgstat_count_heap_insert - count a tuple insertion of n tuples
+ */
+void
+pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_inserted += n;
+    }
+}
+
+/*
+ * pgstat_count_heap_update - count a tuple update
+ */
+void
+pgstat_count_heap_update(Relation rel, bool hot)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_updated++;
+
+        /* t_tuples_hot_updated is nontransactional, so just advance it */
+        if (hot)
+            pgstat_info->t_counts.t_tuples_hot_updated++;
+    }
+}
+
+/*
+ * pgstat_count_heap_delete - count a tuple deletion
+ */
+void
+pgstat_count_heap_delete(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_deleted++;
+    }
+}
+
+/*
+ * pgstat_truncate_save_counters
+ *
+ * Whenever a table is truncated, we save its i/u/d counters so that they can
+ * be cleared, and if the (sub)xact that executed the truncate later aborts,
+ * the counters can be restored to the saved (pre-truncate) values.  Note we do
+ * this on the first truncate in any particular subxact level only.
+ */
+static void
+pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
+{
+    if (!trans->truncated)
+    {
+        trans->inserted_pre_trunc = trans->tuples_inserted;
+        trans->updated_pre_trunc = trans->tuples_updated;
+        trans->deleted_pre_trunc = trans->tuples_deleted;
+        trans->truncated = true;
+    }
+}
+
+/*
+ * pgstat_truncate_restore_counters - restore counters when a truncate aborts
+ */
+static void
+pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
+{
+    if (trans->truncated)
+    {
+        trans->tuples_inserted = trans->inserted_pre_trunc;
+        trans->tuples_updated = trans->updated_pre_trunc;
+        trans->tuples_deleted = trans->deleted_pre_trunc;
+    }
+}
+
+/*
+ * pgstat_count_truncate - update tuple counters due to truncate
+ */
+void
+pgstat_count_truncate(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_truncate_save_counters(pgstat_info->trans);
+        pgstat_info->trans->tuples_inserted = 0;
+        pgstat_info->trans->tuples_updated = 0;
+        pgstat_info->trans->tuples_deleted = 0;
+    }
+}
+
+/*
+ * pgstat_update_heap_dead_tuples - update dead-tuples count
+ *
+ * The semantics of this are that we are reporting the nontransactional
+ * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
+ * rather than increasing, and the change goes straight into the per-table
+ * counter, not into transactional state.
+ */
+void
+pgstat_update_heap_dead_tuples(Relation rel, int delta)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
+}
+
+
+/* ----------
+ * AtEOXact_PgStat
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_PgStat(bool isCommit)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Count transaction commit or abort.  (We use counters, not just bools,
+     * in case the reporting message isn't sent right away.)
+     */
+    if (isCommit)
+        pgStatXactCommit++;
+    else
+        pgStatXactRollback++;
+
+    /*
+     * Transfer transactional insert/update counts into the base tabstat
+     * entries.  We don't bother to free any of the transactional state, since
+     * it's all in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            /* restore pre-truncate stats (if any) in case of aborted xact */
+            if (!isCommit)
+                pgstat_truncate_restore_counters(trans);
+            /* count attempted actions regardless of commit/abort */
+            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+            if (isCommit)
+            {
+                tabstat->t_counts.t_truncated = trans->truncated;
+                if (trans->truncated)
+                {
+                    /* forget live/dead stats seen by backend thus far */
+                    tabstat->t_counts.t_delta_live_tuples = 0;
+                    tabstat->t_counts.t_delta_dead_tuples = 0;
+                }
+                /* insert adds a live tuple, delete removes one */
+                tabstat->t_counts.t_delta_live_tuples +=
+                    trans->tuples_inserted - trans->tuples_deleted;
+                /* update and delete each create a dead tuple */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_updated + trans->tuples_deleted;
+                /* insert, update, delete each count as one change event */
+                tabstat->t_counts.t_changed_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated +
+                    trans->tuples_deleted;
+            }
+            else
+            {
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                /* an aborted xact generates no changed_tuple events */
+            }
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
+}
+
+/* ----------
+ * AtEOSubXact_PgStat
+ *
+ *    Called from access/transam/xact.c at subtransaction commit/abort.
+ * ----------
+ */
+void
+AtEOSubXact_PgStat(bool isCommit, int nestDepth)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Transfer transactional insert/update counts into the next higher
+     * subtransaction state.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL &&
+        xact_state->nest_level >= nestDepth)
+    {
+        PgStat_TableXactStatus *trans;
+        PgStat_TableXactStatus *next_trans;
+
+        /* delink xact_state from stack immediately to simplify reuse case */
+        pgStatXactStack = xact_state->prev;
+
+        for (trans = xact_state->first; trans != NULL; trans = next_trans)
+        {
+            PgStat_TableStatus *tabstat;
+
+            next_trans = trans->next;
+            Assert(trans->nest_level == nestDepth);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            if (isCommit)
+            {
+                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
+                {
+                    if (trans->truncated)
+                    {
+                        /* propagate the truncate status one level up */
+                        pgstat_truncate_save_counters(trans->upper);
+                        /* replace upper xact stats with ours */
+                        trans->upper->tuples_inserted = trans->tuples_inserted;
+                        trans->upper->tuples_updated = trans->tuples_updated;
+                        trans->upper->tuples_deleted = trans->tuples_deleted;
+                    }
+                    else
+                    {
+                        trans->upper->tuples_inserted += trans->tuples_inserted;
+                        trans->upper->tuples_updated += trans->tuples_updated;
+                        trans->upper->tuples_deleted += trans->tuples_deleted;
+                    }
+                    tabstat->trans = trans->upper;
+                    pfree(trans);
+                }
+                else
+                {
+                    /*
+                     * When there isn't an immediate parent state, we can just
+                     * reuse the record instead of going through a
+                     * palloc/pfree pushup (this works since it's all in
+                     * TopTransactionContext anyway).  We have to re-link it
+                     * into the parent level, though, and that might mean
+                     * pushing a new entry into the pgStatXactStack.
+                     */
+                    PgStat_SubXactStatus *upper_xact_state;
+
+                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
+                    trans->next = upper_xact_state->first;
+                    upper_xact_state->first = trans;
+                    trans->nest_level = nestDepth - 1;
+                }
+            }
+            else
+            {
+                /*
+                 * On abort, update top-level tabstat counts, then forget the
+                 * subtransaction
+                 */
+
+                /* first restore values obliterated by truncate */
+                pgstat_truncate_restore_counters(trans);
+                /* count attempted actions regardless of commit/abort */
+                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                tabstat->trans = trans->upper;
+                pfree(trans);
+            }
+        }
+        pfree(xact_state);
+    }
+}
+
+
+/*
+ * AtPrepare_PgStat
+ *        Save the transactional stats state at 2PC transaction prepare.
+ *
+ * In this phase we just generate 2PC records for all the pending
+ * transaction-dependent stats work.
+ */
+void
+AtPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+            TwoPhasePgStatRecord record;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+
+            record.tuples_inserted = trans->tuples_inserted;
+            record.tuples_updated = trans->tuples_updated;
+            record.tuples_deleted = trans->tuples_deleted;
+            record.inserted_pre_trunc = trans->inserted_pre_trunc;
+            record.updated_pre_trunc = trans->updated_pre_trunc;
+            record.deleted_pre_trunc = trans->deleted_pre_trunc;
+            record.t_id = tabstat->t_id;
+            record.t_shared = tabstat->t_shared;
+            record.t_truncated = trans->truncated;
+
+            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
+                                   &record, sizeof(TwoPhasePgStatRecord));
+        }
+    }
+}
+
+/*
+ * PostPrepare_PgStat
+ *        Clean up after successful PREPARE.
+ *
+ * All we need do here is unlink the transaction stats state from the
+ * nontransactional state.  The nontransactional action counts will be
+ * reported to the stats collector immediately, while the effects on live
+ * and dead tuple counts are preserved in the 2PC state file.
+ *
+ * Note: AtEOXact_PgStat is not called during PREPARE.
+ */
+void
+PostPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * We don't bother to free any of the transactional state, since it's all
+     * in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            tabstat = trans->parent;
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+}
+
+/*
+ * 2PC processing routine for COMMIT PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state.
+ */
+void
+pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+                           void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, commit case */
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_truncated = rec->t_truncated;
+    if (rec->t_truncated)
+    {
+        /* forget live/dead stats seen by backend thus far */
+        pgstat_info->t_counts.t_delta_live_tuples = 0;
+        pgstat_info->t_counts.t_delta_dead_tuples = 0;
+    }
+    pgstat_info->t_counts.t_delta_live_tuples +=
+        rec->tuples_inserted - rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_updated + rec->tuples_deleted;
+    pgstat_info->t_counts.t_changed_tuples +=
+        rec->tuples_inserted + rec->tuples_updated +
+        rec->tuples_deleted;
+}
+
+/*
+ * 2PC processing routine for ROLLBACK PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state, but treat them
+ * as aborted.
+ */
+void
+pgstat_twophase_postabort(TransactionId xid, uint16 info,
+                          void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, abort case */
+    if (rec->t_truncated)
+    {
+        rec->tuples_inserted = rec->inserted_pre_trunc;
+        rec->tuples_updated = rec->updated_pre_trunc;
+        rec->tuples_deleted = rec->deleted_pre_trunc;
+    }
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_inserted + rec->tuples_updated;
+}
+
+/* ----------
+ * pgstat_fetch_stat_tabentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    that the table doesn't exist, it is just not yet known by the
+ *    collector, so the caller is better off to report ZERO instead.
+ * ----------
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry(Oid relid)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    /*
+     * If we didn't find it, maybe it's a shared table.
+     */
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = backend_get_tab_entry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    return NULL;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_funcentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one function or NULL.
+ * ----------
+ */
+PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry(Oid func_id)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatFuncEntry *funcentry = NULL;
+
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_SHARED, NULL);
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = backend_get_func_entry(dbentry, func_id, false);
+
+    dshash_release_lock(db_stats, dbentry);
+    return funcentry;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_archiver() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the archiver statistics struct.
+ * ---------
+ */
+PgStat_ArchiverStats *
+pgstat_fetch_stat_archiver(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_archiverStats;
+}
+
+
+/*
+ * ---------
+ * pgstat_fetch_global() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the global statistics struct.
+ * ---------
+ */
+PgStat_GlobalStats *
+pgstat_fetch_global(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    return snapshot_globalStats;
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    /*
+     * If we got as far as discovering our own database ID, we can report what
+     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * database ID, so forget it.  (This means that accesses to pg_database
+     * during failed backend starts might never get counted.)
+     */
+    if (OidIsValid(MyDatabaseId))
+        pgstat_update_stat(true);
+}
+
+
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+
+/* ----------
+ * pgstat_update_archiver() -
+ *
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
+ * ----------
+ */
+void
+pgstat_update_archiver(const char *xlog, bool failed)
+{
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+}
+
+/* ----------
+ * pgstat_update_bgwriter() -
+ *
+ *        Update bgwriter statistics
+ * ----------
+ */
+void
+pgstat_update_bgwriter(void)
+{
+    /* We assume this initializes to zeroes */
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+
+    /*
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid sending a completely empty message to the stats
+     * collector.
+     */
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
+
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
+}
+
+/*
+ * Lock and Unlock dbentry.
+ *
+ * To keep less memory usage, counter reset is done by recreation of dshash
+ * instead of removing individual entries taking whole-dshash lock. On the
+ * other hand dshash cannot be destroyed until all referers have gone. As the
+ * result, counter reset may wait someone writing the table counters. To avoid
+ * such waiting we prepare another generation of table/function hashes and
+ * isolate hashes that is to be destroyed but still be
+ * accessed. pin_hashes() returns "generation" of the current hashes. Unlock
+ * removes the older generation's hashes when all refers have gone.
+ */
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
+{
+    int    counter;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    counter = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
+
+    dshash_release_lock(db_stats, dbentry);
+
+    return counter;
+}
+
+/*
+ * Releases hashes in dbentry. If given generation is isolated, destroy it
+ * after all referers has gone. Otherwise just decrease reference count then
+ * return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /*
+     * using the previous generation, waiting for all referers to end.
+     */
+    Assert(dbentry->generation - 1 == generation); /* allow wrap around */
+
+    if (--dbentry->prev_refcnt == 0)
+    {
+        /* no referer remains, remove the hashes */
+        dshash_table *tables = dshash_attach(area, &dsh_tblparams,
+                                             dbentry->prev_tables, 0);
+        dshash_destroy(tables);
+
+        if (dbentry->prev_functions)
+        {
+            dshash_table *funcs =
+                dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+            dshash_destroy(funcs);
+        }
+        dbentry->prev_tables = DSM_HANDLE_INVALID;
+        dbentry->prev_functions = DSM_HANDLE_INVALID;
+    }
+    
+    LWLockRelease(&dbentry->lock);
+    return;
+}
+
+/* attach and return the specified generation of table hash */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
+}
+
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash = 
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+                
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+initialize_dbentry_nonpersistent_members(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS_DB);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
+ *
+ * Tables and functions hashes are initialized to empty.  dbentry holds
+ * previous dshash tables during old ones are still attached. If initialize is
+ * true, the previous tables are also cleared.
+ */
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry, bool initialize)
+{
+    if (initialize)
+        initialize_dbentry_nonpersistent_members(dbentry);
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    dbentry->n_xact_commit = 0;
+    dbentry->n_xact_rollback = 0;
+    dbentry->n_blocks_fetched = 0;
+    dbentry->n_blocks_hit = 0;
+    dbentry->n_tuples_returned = 0;
+    dbentry->n_tuples_fetched = 0;
+    dbentry->n_tuples_inserted = 0;
+    dbentry->n_tuples_updated = 0;
+    dbentry->n_tuples_deleted = 0;
+    dbentry->last_autovac_time = 0;
+    dbentry->n_conflict_tablespace = 0;
+    dbentry->n_conflict_lock = 0;
+    dbentry->n_conflict_snapshot = 0;
+    dbentry->n_conflict_bufferpin = 0;
+    dbentry->n_conflict_startup_deadlock = 0;
+    dbentry->n_temp_files = 0;
+    dbentry->n_temp_bytes = 0;
+    dbentry->n_deadlocks = 0;
+    dbentry->n_block_read_time = 0;
+    dbentry->n_block_write_time = 0;
+
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Just destroy it then
+         * create new one.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+
+        /* functions table is created on-demand */
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl;
+        
+        tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+    }
+
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+    dbentry->stats_timestamp = 0;
+
+    LWLockRelease(&dbentry->lock);
+
+    dbentry->snapshot_tables = NULL;
+    dbentry->snapshot_functions = NULL;
+}
+
+/*
+ * Lookup the hash table entry for the specified database. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupState *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result, true);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
+
+    return result;
+}
+
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
+{
+    PgStat_StatTabEntry *result;
+    bool        found;
+
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
+
+    if (!create && !found)
+        return NULL;
+
+    /* If not found, initialize the new one. */
+    if (!found)
+    {
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
+    }
+
+    return result;
+}
+
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_db_statsfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid,
+                    char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *        Write the stat file for a single database.
+ *
+ *    If writing to the permanent file (happens when the collector is
+ *    shutting down only), remove the temporary file so that backends
+ *    starting up under a new postmaster can't read the old data before
+ *    the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in some existing statistics collector files into the shared stats
+ *    hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgStatLocalContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stas area
+     */
+    /* Hold lock so that no other process looks empty stats */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        LWLockRelease(StatsLock);
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
+                if (found)
+                {
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                initialize_dbentry_nonpersistent_members(dbentry);
+                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
+
+                dbentry->snapshot_tables = NULL;
+                dbentry->snapshot_functions = NULL;
+
+                /*
+                 * In the collector, disregard the timestamp we read from the
+                 * permanent stats file; we should be willing to write a temp
+                 * stats file immediately upon the first request from any
+                 * backend.
+                 */
+                dbentry->stats_timestamp = 0;
+
+                /*
+                 * If requested, read the data from the database-specific
+                 * file.  Otherwise we just leave the hashtables empty.
+                 */
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                /* we don't create function hash at the present */
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    LWLockRelease(StatsLock);
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* we trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if table data not wanted.
+                 */
+                if (tabhash == NULL)
+                    break;
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Skip if function data not wanted.
+                 */
+                if (funchash == NULL)
+                    break;
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_clean_snapshot: clean up the local cache that will cause new
+ * snapshots to be read.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    Assert(pgStatLocalContext);
+    MemoryContextReset(pgStatLocalContext);
+
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
+}
+
+/*
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
+ */
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
+{
+    HTAB *result;
+    HASHCTL ctl;
+
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = pgStatLocalContext;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
+}
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash.
+ *
+ * Returns the entry for key or NULL if not found. If dest is not null, uses
+ * *dest as local cache, which is created in the same shape with the given
+ * dshash when *dest is NULL. In both cases the result is cached in the hash
+ * and the same entry is returned to subsequent calls for the same key.
+ * 
+ * Otherwise returned entry is a copy that is palloc'ed in caller's memory
+ * context. Its content may differ for every request.
+ *
+ * If dshash is NULL, temporarily attaches dsh_handle instead.
+ */
+static void *
+snapshot_statentry(HTAB **dest, const char *hashname,
+                   dshash_table *dshash, dshash_table_handle dsh_handle,
+                   const dshash_parameters *dsh_params, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = dsh_params->key_size;
+    size_t entrysize = dsh_params->entry_size;
+
+    if (dest)
+    {
+        /* caches the result entry */
+        bool found;
+        bool *negative;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        if (!*dest)
+        {
+            /* make room for negative flag at the end of entry */
+            *dest = create_local_stats_hash(hashname, keysize,
+                                            entrysize + sizeof(bool), 32);
+        }
+
+        lentry = hash_search(*dest, &key, HASH_ENTER, &found);
+
+        /* negative flag is placed at the end of the entry */
+        negative = (bool *) (lentry + entrysize);
+
+        if (!found)
+        {
+            /* not found in local cache, search shared hash */
+
+            dshash_table *t = dshash;
+            void *sentry;
+
+            /* attach shared hash if not given */
+            if (!t)
+                t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+            sentry = dshash_find(t, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+
+            if (sentry)
+            {
+                memcpy(lentry, sentry, entrysize);
+                dshash_release_lock(t, sentry);
+            }
+
+            *negative = !sentry;
+
+            /* Release it if we attached it here */
+            if (!dshash)
+                dshash_detach(t);
+
+            if (!sentry)
+                return NULL;
+        }
+
+        if (*negative)
+            lentry = NULL;
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        dshash_table *t = dshash;
+        void *sentry;
+
+        if (!t)
+            t = dshash_attach(area, dsh_params, dsh_handle, NULL);
+
+        sentry = dshash_find(t, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(entrysize);
+            memcpy(lentry, sentry, entrysize);
+            dshash_release_lock(t, sentry);
+        }
+
+        if (!dshash)
+            dshash_detach(t);
+    }
+
+    return (void *) lentry;
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a local copy of global stats if not already done.  They will be kept
+ * until pgstat_clear_snapshot() is called or the end of the current memory
+ * context (typically TopTransactionContext).  Returns false if the shared
+ * stats is not created yet.
+ */
+static bool
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext = CurrentMemoryContext;
+    TimestampTz update_time = 0;
+        
+
+    /*
+     * This is the first call in a transaction. If we find the shared stats
+     * updated, throw away the cache.
+     */
+    if (IsTransactionState() && first_in_xact)
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (backend_cache_expire < update_time)
+        {
+            pgstat_clear_snapshot();
+
+            /*
+             * Shared stats are updated frequently when many backends are
+             * running, but we don't want the cached stats to be expired so
+             * frequently. Keep them at least for the same duration with
+             * minimal stats update interval of a backend.
+             */
+            backend_cache_expire =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+    
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return true;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    /*
+     * The snapshot lives within the current top transaction if any, or the
+     * current memory context lifetime otherwise.
+     */
+    if (IsTransactionState())
+        oldcontext = MemoryContextSwitchTo(pgStatLocalContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If oneshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local database stats hash";
+    PgStat_StatDBEntry *dbentry;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    if (!backend_snapshot_global_stats())
+        return NULL;
+
+    dbentry = snapshot_statentry(oneshot ? NULL : &snapshot_db_stats,
+                                 hashname, db_stats, 0, &dsh_dbparams,
+                                 dbid);
+    
+    return dbentry;
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If oneshot is true, they are not cached and returned in a
+ *    palloc'ed memory in caller's context.
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* take a local snapshot if we don't have one */
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->tables, &dsh_tblparams,
+                              reloid);
+}
+
+/* ----------
+ * backend_get_func_entry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If oneshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+static PgStat_StatFuncEntry *
+backend_get_func_entry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    char *hashname = "local table stats hash";
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    return snapshot_statentry(oneshot ? NULL : &dbent->snapshot_tables,
+                              hashname, NULL, dbent->functions, &dsh_funcparams,
+                              funcid);
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                   "Activity statistics snapshot",
+                                                   ALLOCSET_SMALL_SIZES);
+}
+
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..4edd980ffc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c2c445dbf4..0bb2132c71 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 1f766d20d1..a0401ee494 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..6bc5fd6089 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 23ccc59f13..ceb4775b9f 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aeda32c9c5..e84275d4c2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..97bca9be24 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -281,8 +282,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 7da337d11f..97526f1c72 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf93357997..e893984383 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 6e471c3e43..cfa5c9089f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57a80..243da57c49 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 74eb449060..dd76088a29 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..2cd4d5531e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6fc11f26f0..a8efa7cc5f 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -194,8 +194,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0da5b19719..a60fd02894 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..9e9995ae50 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..8fb81d2fc1 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -3159,6 +3160,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_update_stat(true);
+    }
 }
 
 
@@ -3733,6 +3740,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4181,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_update_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4226,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4234,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d330a88e3c..c0975a8259 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -21,6 +21,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -29,7 +30,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..667e8e5560 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -33,7 +34,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -1193,7 +1194,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1209,7 +1210,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1225,7 +1226,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1241,7 +1242,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1257,7 +1258,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1273,7 +1274,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1289,7 +1290,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1305,7 +1306,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1321,7 +1322,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1336,7 +1337,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1354,7 +1355,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1370,7 +1371,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1385,7 +1386,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1400,7 +1401,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1415,7 +1416,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1430,7 +1431,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1445,7 +1446,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1460,7 +1461,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1480,7 +1481,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1496,7 +1497,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1512,7 +1513,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1867,6 +1868,9 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     /* Get statistics about the archiver process */
     archiver_stats = pgstat_fetch_stat_archiver();
 
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
+
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
     if (*(archiver_stats->last_archived_wal) == '\0')
@@ -1896,6 +1900,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 5e61d908fd..2dd99f935d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd2e4e89d8..1eabc0f41d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..e5dca7fe03 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -26,6 +26,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -72,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -685,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
@@ -1238,6 +1245,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..62a07727d0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..8939758c59 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..b7f6a93130
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,555 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ *
+ * All char arrays must be null-terminated.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];
+    char        ssl_cipher[NAMEDATALEN];
+    char        ssl_client_dn[NAMEDATALEN];
+
+    /*
+     * serial number is max "20 octets" per RFC 5280, so this size should be
+     * fine
+     */
+    char        ssl_client_serial[NAMEDATALEN];
+
+    char        ssl_issuer_dn[NAMEDATALEN];
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+extern void AtEOXact_BEStatus(bool isCommit);
+extern void AtPrepare_BEStatus(void);
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..49131a6d5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 471877d2df..eaa9d11c2a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,11 +13,10 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +40,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -420,40 +162,16 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_buf_alloc;
     PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
     PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -568,6 +213,12 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
+typedef struct PgStat_DSHash
+{
+    int        refcnt;
+    dshash_table_handle handle;
+} PgStat_DSHash;
+
 /* ----------
  * PgStat_StatDBEntry            The collector's data per database
  * ----------
@@ -600,11 +251,21 @@ typedef struct PgStat_StatDBEntry
     TimestampTz stats_timestamp;    /* time of db stats file update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;
+    int    refcnt;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    int    prev_refcnt;
+    dshash_table_handle prev_tables;
+    dshash_table_handle prev_functions;
+    LWLock    lock;
+
+    /* not for shared struct */
+    HTAB *snapshot_tables;
+    HTAB *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -660,7 +321,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +337,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -694,432 +355,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- *
- * All char arrays must be null-terminated.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];
-    char        ssl_cipher[NAMEDATALEN];
-    char        ssl_client_dn[NAMEDATALEN];
-
-    /*
-     * serial number is max "20 octets" per RFC 5280, so this size should be
-     * fine
-     */
-    char        ssl_client_serial[NAMEDATALEN];
-
-    char        ssl_issuer_dn[NAMEDATALEN];
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -1141,18 +376,18 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1164,34 +399,20 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
-
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_update_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1202,88 +423,20 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
+extern void pgstat_clear_snapshot(void);
+
 extern void pgstat_initialize(void);
+extern void pgstat_bearray_initialize(void);
 extern void pgstat_bestart(void);
 
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
-
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
-
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
+extern PgStat_StatDBEntry *backend_get_db_entry(Oid dbid, bool oneshot);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid, bool oneshot);
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -1348,21 +501,30 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
+
+extern void pgstat_report_tempfile(size_t filesize);
 
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 7c44f4a6e7..c37ec33e9b 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..daa269f816 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index c1878dd694..7391e05f37 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From cb12169f048a59ea172a6c3bd24c13af362cf6fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/statmon/pgstat.c                  | 13 ++++-----
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..79f704cc99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6709,25 +6709,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5c1408bdf5..b538799ff6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index cbdad0c3fb..133eb3ff19 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index e30b2dbcf0..a567aacf73 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -231,11 +231,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -266,13 +263,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
index 9dcd7ab540..805b64e758 100644
--- a/src/backend/statmon/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -89,15 +89,12 @@ typedef enum PgStat_TableLookupState
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap information */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 62a07727d0..49123204c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -192,7 +192,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3989,17 +3988,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11003,35 +10991,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..fdb088dbfd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,7 +554,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index eaa9d11c2a..28e039f4bc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -29,7 +29,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
At Mon, 18 Feb 2019 21:35:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190218.213531.89078771.horiguchi.kyotaro@lab.ntt.co.jp>
> I'm trying removing pgStatPendingTabHash it does't work yet. I'll
> include it in the next version.

Done. Passed a test for the case of internittent dshash lock
failure, which causes "pending" of local stats info.

- Removed pgStatPendingTabHash. "Pending" table entries are left
  in pgStatTabList after pgstat_update_stat(). There's no longer
  a "pending" stats store.

- Fixed several bugs of reading/writing at-rest file.

- Transactional snapshot behaved wrongly. Fixed it.

- In this project SQL helper functions are renamed from
 pgstat_fetch_stat_* to backend_fetch_stat_* because the
 functions with the similar cuntion are implemented for writer
 side and they have the names of pgstat_fetch_stat_*.
 
 But some of the functions had very confusing names that don't
 follow the convention.

- Split pgStatLocalContext into pgSharedStatsContext and
  pgStatSnapshotContext. The former is for shared statistics and
  the latter is for transactional snapshot.

- Clean up pgstat.[ch] removing stale lines, fixing bogus comments.

This version is heavily improved from the previous version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b202e8a43c13514925769b8dd125dc702ff3be8e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..d1908a6137 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+        
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 951a6afc196796b37da46b2d933b1c220379311d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 ++++-
 2 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index d1908a6137..db8d6899af 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..fe1d4d75c5 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
-                      const void *key, bool *found);
+            const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 5e3c97bab2effa02e8434474f7acde7e7fc8d373 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..a6c3338d40 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -328,6 +328,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -455,6 +458,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, bgwriter has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..9e6bce8f6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2856,6 +2856,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4120,6 +4123,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_BG_WRITER:
             backendDesc = "background writer";
             break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_CHECKPOINTER:
             backendDesc = "checkpointer";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..a663a62fd5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
@@ -1761,7 +1763,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2924,7 +2926,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3069,10 +3071,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3318,7 +3318,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3523,6 +3523,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3799,6 +3811,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5068,7 +5081,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5346,6 +5359,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case CheckpointerProcess:
                 ereport(LOG,
                         (errmsg("could not fork checkpointer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..471877d2df 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -706,6 +706,7 @@ typedef enum BackendType
     B_BACKEND,
     B_BG_WORKER,
     B_BG_WRITER,
+    B_ARCHIVER,
     B_CHECKPOINTER,
     B_STARTUP,
     B_WAL_RECEIVER,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 1f993c1acf7756af5ea129c5be44ff2064032bd5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Nov 2018 17:26:33 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 contrib/pg_prewarm/autoprewarm.c                   |    2 +-
 contrib/pg_stat_statements/pg_stat_statements.c    |    1 +
 contrib/postgres_fdw/connection.c                  |    2 +-
 src/backend/Makefile                               |    2 +-
 src/backend/access/heap/rewriteheap.c              |    4 +-
 src/backend/access/heap/vacuumlazy.c               |    1 +
 src/backend/access/nbtree/nbtree.c                 |    2 +-
 src/backend/access/nbtree/nbtsort.c                |    2 +-
 src/backend/access/transam/clog.c                  |    2 +-
 src/backend/access/transam/parallel.c              |    1 +
 src/backend/access/transam/slru.c                  |    2 +-
 src/backend/access/transam/timeline.c              |    2 +-
 src/backend/access/transam/twophase.c              |    2 +
 src/backend/access/transam/xact.c                  |    3 +
 src/backend/access/transam/xlog.c                  |    1 +
 src/backend/access/transam/xlogfuncs.c             |    2 +-
 src/backend/access/transam/xlogutils.c             |    2 +-
 src/backend/bootstrap/bootstrap.c                  |    8 +-
 src/backend/executor/execParallel.c                |    2 +-
 src/backend/executor/nodeBitmapHeapscan.c          |    1 +
 src/backend/executor/nodeGather.c                  |    1 +
 src/backend/executor/nodeHash.c                    |    2 +-
 src/backend/executor/nodeHashjoin.c                |    2 +-
 src/backend/libpq/be-secure-openssl.c              |    2 +-
 src/backend/libpq/be-secure.c                      |    2 +-
 src/backend/libpq/pqmq.c                           |    2 +-
 src/backend/postmaster/Makefile                    |    2 +-
 src/backend/postmaster/autovacuum.c                |   50 +-
 src/backend/postmaster/bgworker.c                  |    2 +-
 src/backend/postmaster/bgwriter.c                  |    5 +-
 src/backend/postmaster/checkpointer.c              |   17 +-
 src/backend/postmaster/pgarch.c                    |    5 +-
 src/backend/postmaster/pgstat.c                    | 6385 --------------------
 src/backend/postmaster/postmaster.c                |   86 +-
 src/backend/postmaster/syslogger.c                 |    2 +-
 src/backend/postmaster/walwriter.c                 |    2 +-
 src/backend/replication/basebackup.c               |    1 +
 .../libpqwalreceiver/libpqwalreceiver.c            |    2 +-
 src/backend/replication/logical/launcher.c         |    2 +-
 src/backend/replication/logical/origin.c           |    3 +-
 src/backend/replication/logical/reorderbuffer.c    |    2 +-
 src/backend/replication/logical/snapbuild.c        |    2 +-
 src/backend/replication/logical/tablesync.c        |   15 +-
 src/backend/replication/logical/worker.c           |    5 +-
 src/backend/replication/slot.c                     |    2 +-
 src/backend/replication/syncrep.c                  |    2 +-
 src/backend/replication/walreceiver.c              |    2 +-
 src/backend/replication/walsender.c                |    2 +-
 src/backend/statmon/Makefile                       |   17 +
 src/backend/statmon/bestatus.c                     | 1781 ++++++
 src/backend/statmon/pgstat.c                       | 4072 +++++++++++++
 src/backend/storage/buffer/bufmgr.c                |    1 +
 src/backend/storage/file/buffile.c                 |    2 +-
 src/backend/storage/file/copydir.c                 |    2 +-
 src/backend/storage/file/fd.c                      |    1 +
 src/backend/storage/ipc/dsm.c                      |   24 +-
 src/backend/storage/ipc/dsm_impl.c                 |    2 +-
 src/backend/storage/ipc/ipci.c                     |    7 +
 src/backend/storage/ipc/latch.c                    |    2 +-
 src/backend/storage/ipc/procarray.c                |    2 +-
 src/backend/storage/ipc/shm_mq.c                   |    2 +-
 src/backend/storage/ipc/standby.c                  |    2 +-
 src/backend/storage/lmgr/deadlock.c                |    1 +
 src/backend/storage/lmgr/lwlock.c                  |    5 +-
 src/backend/storage/lmgr/lwlocknames.txt           |    1 +
 src/backend/storage/lmgr/predicate.c               |    2 +-
 src/backend/storage/lmgr/proc.c                    |    2 +-
 src/backend/storage/smgr/md.c                      |    2 +-
 src/backend/tcop/postgres.c                        |   28 +-
 src/backend/utils/adt/misc.c                       |    2 +-
 src/backend/utils/adt/pgstatfuncs.c                |  125 +-
 src/backend/utils/cache/relmapper.c                |    2 +-
 src/backend/utils/init/globals.c                   |    1 +
 src/backend/utils/init/miscinit.c                  |    2 +-
 src/backend/utils/init/postinit.c                  |   15 +
 src/backend/utils/misc/guc.c                       |    1 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl       |    2 +-
 src/include/bestatus.h                             |  555 ++
 src/include/miscadmin.h                            |    2 +-
 src/include/pgstat.h                               |  960 +--
 src/include/storage/dsm.h                          |    3 +
 src/include/storage/lwlock.h                       |    3 +
 src/include/utils/timeout.h                        |    1 +
 src/test/modules/worker_spi/worker_spi.c           |    2 +-
 84 files changed, 6757 insertions(+), 7528 deletions(-)
 delete mode 100644 src/backend/postmaster/pgstat.c
 create mode 100644 src/backend/statmon/Makefile
 create mode 100644 src/backend/statmon/bestatus.c
 create mode 100644 src/backend/statmon/pgstat.c
 create mode 100644 src/include/bestatus.h

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index 9cc4b2dc83..406efbd49b 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -30,10 +30,10 @@
 
 #include "access/relation.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_type.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/buf_internals.h"
 #include "storage/dsm.h"
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9905593661..8523bc5300 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -62,6 +62,7 @@
 #include <unistd.h>
 
 #include "access/hash.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 239d220c24..1ea71245df 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -15,11 +15,11 @@
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/latch.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 478a96db9b..cc511672c9 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -20,7 +20,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access bootstrap catalog parser commands executor foreign lib libpq \
     main nodes optimizer partitioning port postmaster \
     regex replication rewrite \
-    statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
+    statistics statmon storage tcop tsearch utils $(top_builddir)/src/timezone \
     jit
 
 include $(srcdir)/common.mk
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..adfd5f40fd 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -115,12 +115,12 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 
+#include "bestatus.h"
+
 #include "catalog/catalog.h"
 
 #include "lib/ilist.h"
 
-#include "pgstat.h"
-
 #include "replication/logical.h"
 #include "replication/slot.h"
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9416c31889..928d53a68c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..69cd211369 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,10 +22,10 @@
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..c2a3ed0209 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -64,9 +64,9 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"        /* pgrminclude ignore */
 #include "utils/rel.h"
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index aa089d83fa..cf034ba333 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -38,8 +38,8 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index ce2b61631d..8d5cbfa41d 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -19,6 +19,7 @@
 #include "access/session.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_enum.h"
 #include "catalog/index.h"
 #include "catalog/namespace.h"
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..a28fe474aa 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -54,7 +54,7 @@
 #include "access/slru.h"
 #include "access/transam.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 #include "storage/shmem.h"
 #include "miscadmin.h"
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..bbe9c0eb5f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,7 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "storage/fd.h"
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..0dc9f39424 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -87,6 +87,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "catalog/storage.h"
 #include "funcapi.h"
@@ -1569,6 +1570,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
     PredicateLockTwoPhaseFinish(xid, isCommit);
 
     /* Count the prepared xact as committed or aborted */
+    AtEOXact_BEStatus(isCommit);
     AtEOXact_PgStat(isCommit);
 
     /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..e5026bd261 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -2148,6 +2149,7 @@ CommitTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_BEStatus(true);
     AtEOXact_PgStat(true);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
@@ -2641,6 +2643,7 @@ AbortTransaction(void)
         AtEOXact_Files(false);
         AtEOXact_ComboCid();
         AtEOXact_HashTables(false);
+        AtEOXact_BEStatus(false);
         AtEOXact_PgStat(false);
         AtEOXact_ApplyLauncher(false);
         pgstat_report_xact_timestamp(0);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..61a90a2811 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b35043bf71..683c41575f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -23,9 +23,9 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
+#include "bestatus.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..53fa4890e9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -23,8 +23,8 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/smgr.h"
 #include "utils/guc.h"
 #include "utils/hsearch.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a6c3338d40..79f624f0e0 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -22,6 +22,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
@@ -328,9 +329,6 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
-            case ArchiverProcess:
-                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
-                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -340,6 +338,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -416,6 +417,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
         CreateAuxProcessResourceOwner();
 
         /* Initialize backend status information */
+        pgstat_bearray_initialize();
         pgstat_initialize();
         pgstat_bestart();
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index b79be91655..e53c0fb808 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -46,7 +46,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 5e74585d5e..03a703075e 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -41,6 +41,7 @@
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 69d5a1f239..36859360b6 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -32,6 +32,7 @@
 
 #include "access/relscan.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "executor/execdebug.h"
 #include "executor/execParallel.h"
 #include "executor/nodeGather.h"
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 856daf6a7f..5a47eb4601 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -28,6 +28,7 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "catalog/pg_statistic.h"
 #include "commands/tablespace.h"
 #include "executor/execdebug.h"
@@ -35,7 +36,6 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "utils/dynahash.h"
 #include "utils/memutils.h"
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 2098708864..898a7916b0 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -108,12 +108,12 @@
 
 #include "access/htup_details.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "executor/executor.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "utils/memutils.h"
 #include "utils/sharedtuplestore.h"
 
diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 4490516b9e..711d929999 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -36,9 +36,9 @@
 #include <openssl/ec.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
 #include "tcop/tcopprot.h"
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index a7def3168d..fa1cf6cffa 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -29,9 +29,9 @@
 #include <arpa/inet.h>
 #endif
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 #include "storage/ipc.h"
diff --git a/src/backend/libpq/pqmq.c b/src/backend/libpq/pqmq.c
index a9bd47d937..f79a70d6fe 100644
--- a/src/backend/libpq/pqmq.c
+++ b/src/backend/libpq/pqmq.c
@@ -13,11 +13,11 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "libpq/pqmq.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 71c23211b2..311e63017d 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgworker.o bgwriter.o checkpointer.o fork_process.o \
-    pgarch.o pgstat.o postmaster.o startup.o syslogger.o walwriter.o
+    pgarch.o postmaster.o startup.o syslogger.o walwriter.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d1177b3855..bd419f9777 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -71,6 +71,7 @@
 #include "access/reloptions.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/dependency.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
@@ -547,7 +548,7 @@ AutoVacLauncherMain(int argc, char *argv[])
          * Make sure pgstat also considers our stat data as gone.  Note: we
          * mustn't use autovac_refresh_stats here.
          */
-        pgstat_clear_snapshot();
+        backend_clear_stats_snapshot();
 
         /* Now we can allow interrupts again */
         RESUME_INTERRUPTS();
@@ -968,7 +969,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider this database if it has a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(newdb);
+        entry = pgstat_fetch_stat_dbentry(newdb, true);
         if (entry != NULL)
         {
             /* we assume it isn't found because the hash was just created */
@@ -977,6 +978,7 @@ rebuild_database_list(Oid newdb)
             /* hash_search already filled in the key */
             db->adl_score = score++;
             /* next_worker is filled in later */
+            pfree(entry);
         }
     }
 
@@ -992,7 +994,7 @@ rebuild_database_list(Oid newdb)
          * skip databases with no stat entries -- in particular, this gets rid
          * of dropped databases
          */
-        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adl_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1004,6 +1006,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
 
     /* finally, insert all qualifying databases not previously inserted */
@@ -1016,7 +1019,7 @@ rebuild_database_list(Oid newdb)
         PgStat_StatDBEntry *entry;
 
         /* only consider databases with a pgstat entry */
-        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid);
+        entry = pgstat_fetch_stat_dbentry(avdb->adw_datid, true);
         if (entry == NULL)
             continue;
 
@@ -1028,6 +1031,7 @@ rebuild_database_list(Oid newdb)
             db->adl_score = score++;
             /* next_worker is filled in later */
         }
+        pfree(entry);
     }
     nelems = score;
 
@@ -1226,7 +1230,7 @@ do_start_worker(void)
             continue;            /* ignore not-at-risk DBs */
 
         /* Find pgstat entry if any */
-        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid);
+        tmp->adw_entry = pgstat_fetch_stat_dbentry(tmp->adw_datid, true);
 
         /*
          * Skip a database with no pgstat entry; it means it hasn't seen any
@@ -1265,7 +1269,12 @@ do_start_worker(void)
             }
         }
         if (skipit)
+        {
+            /* Immediately free it if not used */
+            if(avdb != tmp)
+                pfree(tmp->adw_entry);
             continue;
+        }
 
         /*
          * Remember the db with oldest autovac time.  (If we are here, both
@@ -1273,7 +1282,12 @@ do_start_worker(void)
          */
         if (avdb == NULL ||
             tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
+        {
+            if (avdb)
+                pfree(avdb->adw_entry);
+
             avdb = tmp;
+        }
     }
 
     /* Found a database -- process it */
@@ -1962,7 +1976,7 @@ do_autovacuum(void)
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
@@ -2012,7 +2026,7 @@ do_autovacuum(void)
     MemoryContextSwitchTo(AutovacMemCxt);
 
     /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
 
     classRel = table_open(RelationRelationId, AccessShareLock);
 
@@ -2098,6 +2112,8 @@ do_autovacuum(void)
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* Relations that need work are added to table_oids */
         if (dovacuum || doanalyze)
@@ -2177,10 +2193,11 @@ do_autovacuum(void)
         /* Fetch the pgstat entry for this table */
         tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
                                              shared, dbentry);
-
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
                                   &dovacuum, &doanalyze, &wraparound);
+        if (tabentry)
+            pfree(tabentry);
 
         /* ignore analyze for toast tables */
         if (dovacuum)
@@ -2749,12 +2766,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry(shared, relid, true);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry(dbentry, relid, true);
 
     return tabentry;
 }
@@ -2786,8 +2801,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    shared = pgstat_fetch_stat_dbentry(InvalidOid, true);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, true);
 
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
@@ -2818,6 +2833,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
                               &dovacuum, &doanalyze, &wraparound);
+    if (tabentry)
+        pfree(tabentry);
 
     /* ignore ANALYZE for toast tables */
     if (classForm->relkind == RELKIND_TOASTVALUE)
@@ -2908,7 +2925,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     heap_freetuple(classTup);
-
+    pfree(shared);
+    pfree(dbentry);
     return tab;
 }
 
@@ -3367,5 +3385,5 @@ autovac_refresh_stats(void)
         last_read = current_time;
     }
 
-    pgstat_clear_snapshot();
+    backend_clear_stats_snapshot();
 }
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f5db5a8c4a..7d7d55ef1a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,8 +16,8 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..c820d35fbc 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -40,6 +40,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -267,9 +268,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Update activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..9f70cd0e52 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -43,6 +43,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -515,13 +516,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Update activity statistics.  (The reason why we re-use
+         * bgwriter-related code for this is that the bgwriter and
+         * checkpointer used to be just one process.  It's probably not worth
+         * the trouble to split the stats support into two independent
+         * functions.)
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -682,9 +683,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Register interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_update_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4342ebdab4..2a7c4fd1b1 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -35,6 +35,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -468,7 +469,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_update_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -478,7 +479,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_update_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
deleted file mode 100644
index 9e6bce8f6a..0000000000
--- a/src/backend/postmaster/pgstat.c
+++ /dev/null
@@ -1,6385 +0,0 @@
-/* ----------
- * pgstat.c
- *
- *    All the statistics collector stuff hacked up in one big, ugly file.
- *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
- *
- *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
- *
- *    src/backend/postmaster/pgstat.c
- * ----------
- */
-#include "postgres.h"
-
-#include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
-
-#include "pgstat.h"
-
-#include "access/heapam.h"
-#include "access/htup_details.h"
-#include "access/transam.h"
-#include "access/twophase_rmgr.h"
-#include "access/xact.h"
-#include "catalog/pg_database.h"
-#include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
-#include "miscadmin.h"
-#include "pg_trace.h"
-#include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
-#include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
-#include "storage/ipc.h"
-#include "storage/latch.h"
-#include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
-#include "storage/procsignal.h"
-#include "storage/sinvaladt.h"
-#include "utils/ascii.h"
-#include "utils/guc.h"
-#include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
-#include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
-
-/* ----------
- * Timer definitions.
- * ----------
- */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
-
-
-/* ----------
- * The initial size hints for the hash tables used in the collector.
- * ----------
- */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
-#define PGSTAT_FUNCTION_HASH_SIZE    512
-
-
-/* ----------
- * Total number of backends including auxiliary
- *
- * We reserve a slot for each possible BackendId, plus one for each
- * possible auxiliary process type.  (This scheme assumes there is not
- * more than one of any auxiliary process type at a time.) MaxBackends
- * includes autovacuum workers and background workers as well.
- * ----------
- */
-#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
-
-
-/* ----------
- * GUC parameters
- * ----------
- */
-bool        pgstat_track_activities = false;
-bool        pgstat_track_counts = false;
-int            pgstat_track_functions = TRACK_FUNC_OFF;
-int            pgstat_track_activity_query_size = 1024;
-
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
-{
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
-
-/*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
- */
-typedef struct TabStatHashEntry
-{
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
-
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
-
-/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
- */
-static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
-
-/*
- * Tuple insertion/deletion counts for an open transaction can't be propagated
- * into PgStat_TableStatus counters until we know if it is going to commit
- * or abort.  Hence, we keep these counts in per-subxact structs that live
- * in TopTransactionContext.  This data structure is designed on the assumption
- * that subxacts won't usually modify very many tables.
- */
-typedef struct PgStat_SubXactStatus
-{
-    int            nest_level;        /* subtransaction nest level */
-    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
-    PgStat_TableXactStatus *first;    /* head of list for this subxact */
-} PgStat_SubXactStatus;
-
-static PgStat_SubXactStatus *pgStatXactStack = NULL;
-
-static int    pgStatXactCommit = 0;
-static int    pgStatXactRollback = 0;
-PgStat_Counter pgStatBlockReadTime = 0;
-PgStat_Counter pgStatBlockWriteTime = 0;
-
-/* Record that's written to 2PC state file when pgstat state is persisted */
-typedef struct TwoPhasePgStatRecord
-{
-    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
-    PgStat_Counter tuples_updated;    /* tuples updated in xact */
-    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
-    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
-    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
-    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
-    bool        t_truncated;    /* was the relation truncated? */
-} TwoPhasePgStatRecord;
-
-/*
- * Info about current "snapshot" of stats file
- */
-static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
-static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
-static int    localNumBackends = 0;
-
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
-
-/*
- * Total time charged to functions so far in the current backend.
- * We use this to help separate "self" and "other" time charges.
- * (We assume this initializes to zero.)
- */
-static instr_time total_func_time;
-
-
-/* ----------
- * Local function forward declarations
- * ----------
- */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
-static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
-static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
-static const char *pgstat_get_wait_activity(WaitEventActivity w);
-static const char *pgstat_get_wait_client(WaitEventClient w);
-static const char *pgstat_get_wait_ipc(WaitEventIPC w);
-static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
-static const char *pgstat_get_wait_io(WaitEventIO w);
-
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
-/* ------------------------------------------------------------
- * Public functions called from postmaster follow
- * ------------------------------------------------------------
- */
-
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
-{
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
-}
-
-/*
- * subroutine for pgstat_reset_all
- */
-static void
-pgstat_reset_remove_files(const char *directory)
-{
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
-
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
-}
-
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
-
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
- *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
- */
-void
-pgstat_report_stat(bool force)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-    {
-        for (i = 0; i < tsa->tsa_used; i++)
-        {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
-
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
-        }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
-    }
-
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
-
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
-{
-    int            n;
-    int            len;
-
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
-    }
-    else
-    {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
-    }
-
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
-
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
-}
-
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
-/* ----------
- * pgstat_vacuum_stat() -
- *
- *    Will tell the collector about objects he can get rid of.
- * ----------
- */
-void
-pgstat_vacuum_stat(void)
-{
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
-}
-
-
-/* ----------
- * pgstat_collect_oids() -
- *
- *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
- *    when done with it.  (However, we make the table in CurrentMemoryContext
- *    so that it will be freed properly in event of an error.)
- * ----------
- */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
-{
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
-    Relation    rel;
-    HeapScanDesc scan;
-    HeapTuple    tup;
-    Snapshot    snapshot;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    rel = table_open(catalogid, AccessShareLock);
-    snapshot = RegisterSnapshot(GetLatestSnapshot());
-    scan = heap_beginscan(rel, snapshot, 0, NULL);
-    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
-    {
-        Oid            thisoid;
-        bool        isnull;
-
-        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
-        Assert(!isnull);
-
-        CHECK_FOR_INTERRUPTS();
-
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
-    }
-    heap_endscan(scan);
-    UnregisterSnapshot(snapshot);
-    table_close(rel, AccessShareLock);
-
-    return htab;
-}
-
-
-/* ----------
- * pgstat_drop_database() -
- *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
- */
-void
-pgstat_drop_database(Oid databaseid)
-{
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_reset_counters() -
- *
- *    Tell the statistics collector to reset counters for our database.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_counters(void)
-{
-    PgStat_MsgResetcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_shared_counters() -
- *
- *    Tell the statistics collector to reset cluster-wide shared counters.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_shared_counters(const char *target)
-{
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
-        ereport(ERROR,
-                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-                 errmsg("unrecognized reset target: \"%s\"", target),
-                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_reset_single_counter() -
- *
- *    Tell the statistics collector to reset a single counter.
- *
- *    Permission checking for this function is managed through the normal
- *    GRANT system.
- * ----------
- */
-void
-pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
-{
-    PgStat_MsgResetsinglecounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_report_autovac() -
- *
- *    Called from autovacuum.c to report startup of an autovacuum process.
- *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
- *    the db OID must be passed in, instead.
- * ----------
- */
-void
-pgstat_report_autovac(Oid dboid)
-{
-    PgStat_MsgAutovacStart msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ---------
- * pgstat_report_vacuum() -
- *
- *    Tell the collector about the table we just vacuumed.
- * ---------
- */
-void
-pgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
-{
-    PgStat_MsgVacuum msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_analyze() -
- *
- *    Tell the collector about the table we just analyzed.
- *
- * Caller must provide new live- and dead-tuples estimates, as well as a
- * flag indicating whether to reset the changes_since_analyze counter.
- * --------
- */
-void
-pgstat_report_analyze(Relation rel,
-                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
-                      bool resetcounter)
-{
-    PgStat_MsgAnalyze msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    /*
-     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
-     * already inserted and/or deleted rows in the target table. ANALYZE will
-     * have counted such rows as live or dead respectively. Because we will
-     * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
-     */
-    if (rel->pgstat_info != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
-        {
-            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
-            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
-        }
-        /* count stuff inserted by already-aborted subxacts, too */
-        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
-        /* Since ANALYZE's counts are estimates, we could have underflowed */
-        livetuples = Max(livetuples, 0);
-        deadtuples = Max(deadtuples, 0);
-    }
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_recovery_conflict() -
- *
- *    Tell the collector about a Hot Standby recovery conflict.
- * --------
- */
-void
-pgstat_report_recovery_conflict(int reason)
-{
-    PgStat_MsgRecoveryConflict msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- *    Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
-    PgStat_MsgDeadlock msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_tempfile() -
- *
- *    Tell the collector about a temporary file.
- * --------
- */
-void
-pgstat_report_tempfile(size_t filesize)
-{
-    PgStat_MsgTempFile msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
-void
-pgstat_init_function_usage(FunctionCallInfo fcinfo,
-                           PgStat_FunctionCallUsage *fcu)
-{
-    PgStat_BackendFunctionEntry *htabent;
-    bool        found;
-
-    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
-    {
-        /* stats not wanted */
-        fcu->fs = NULL;
-        return;
-    }
-
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
-
-    fcu->fs = &htabent->f_counts;
-
-    /* save stats for this function, later used to compensate for recursion */
-    fcu->save_f_total_time = htabent->f_counts.f_total_time;
-
-    /* save current backend-wide total time */
-    fcu->save_total = total_func_time;
-
-    /* get clock time as of function start */
-    INSTR_TIME_SET_CURRENT(fcu->f_start);
-}
-
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
- *
- * If no entry, return NULL, don't create a new one
- */
-PgStat_BackendFunctionEntry *
-find_funcstat_entry(Oid func_id)
-{
-    if (pgStatFunctions == NULL)
-        return NULL;
-
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
-}
-
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
- *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
- */
-void
-pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
-{
-    PgStat_FunctionCounts *fs = fcu->fs;
-    instr_time    f_total;
-    instr_time    f_others;
-    instr_time    f_self;
-
-    /* stats not wanted? */
-    if (fs == NULL)
-        return;
-
-    /* total elapsed time in this function call */
-    INSTR_TIME_SET_CURRENT(f_total);
-    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
-
-    /* self usage: elapsed minus anything already charged to other calls */
-    f_others = total_func_time;
-    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
-    f_self = f_total;
-    INSTR_TIME_SUBTRACT(f_self, f_others);
-
-    /* update backend-wide total time */
-    INSTR_TIME_ADD(total_func_time, f_self);
-
-    /*
-     * Compute the new f_total_time as the total elapsed time added to the
-     * pre-call value of f_total_time.  This is necessary to avoid
-     * double-counting any time taken by recursive calls of myself.  (We do
-     * not need any similar kluge for self time, since that already excludes
-     * any recursive calls.)
-     */
-    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
-
-    /* update counters in function stats table */
-    if (finalize)
-        fs->f_numcalls++;
-    fs->f_total_time = f_total;
-    INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
-}
-
-
-/* ----------
- * pgstat_initstats() -
- *
- *    Initialize a relcache entry to count access statistics.
- *    Called whenever a relation is opened.
- *
- *    We assume that a relcache entry's pgstat_info field is zeroed by
- *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
- * ----------
- */
-void
-pgstat_initstats(Relation rel)
-{
-    Oid            rel_id = rel->rd_id;
-    char        relkind = rel->rd_rel->relkind;
-
-    /* We only count stats for things that have storage */
-    if (!(relkind == RELKIND_RELATION ||
-          relkind == RELKIND_MATVIEW ||
-          relkind == RELKIND_INDEX ||
-          relkind == RELKIND_TOASTVALUE ||
-          relkind == RELKIND_SEQUENCE))
-    {
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-    {
-        /* We're not counting at all */
-        rel->pgstat_info = NULL;
-        return;
-    }
-
-    /*
-     * If we already set up this relation in the current transaction, nothing
-     * to do.
-     */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
-        return;
-
-    /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
-}
-
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
- */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
-{
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
-}
-
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
- *
- * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
- */
-PgStat_TableStatus *
-find_tabstat_entry(Oid rel_id)
-{
-    TabStatHashEntry *hash_entry;
-
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
-
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
-}
-
-/*
- * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
- */
-static PgStat_SubXactStatus *
-get_tabstat_stack_level(int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state == NULL || xact_state->nest_level != nest_level)
-    {
-        xact_state = (PgStat_SubXactStatus *)
-            MemoryContextAlloc(TopTransactionContext,
-                               sizeof(PgStat_SubXactStatus));
-        xact_state->nest_level = nest_level;
-        xact_state->prev = pgStatXactStack;
-        xact_state->first = NULL;
-        pgStatXactStack = xact_state;
-    }
-    return xact_state;
-}
-
-/*
- * add_tabstat_xact_level - add a new (sub)transaction state record
- */
-static void
-add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
-{
-    PgStat_SubXactStatus *xact_state;
-    PgStat_TableXactStatus *trans;
-
-    /*
-     * If this is the first rel to be modified at the current nest level, we
-     * first have to push a transaction stack entry.
-     */
-    xact_state = get_tabstat_stack_level(nest_level);
-
-    /* Now make a per-table stack entry */
-    trans = (PgStat_TableXactStatus *)
-        MemoryContextAllocZero(TopTransactionContext,
-                               sizeof(PgStat_TableXactStatus));
-    trans->nest_level = nest_level;
-    trans->upper = pgstat_info->trans;
-    trans->parent = pgstat_info;
-    trans->next = xact_state->first;
-    xact_state->first = trans;
-    pgstat_info->trans = trans;
-}
-
-/*
- * pgstat_count_heap_insert - count a tuple insertion of n tuples
- */
-void
-pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_inserted += n;
-    }
-}
-
-/*
- * pgstat_count_heap_update - count a tuple update
- */
-void
-pgstat_count_heap_update(Relation rel, bool hot)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_updated++;
-
-        /* t_tuples_hot_updated is nontransactional, so just advance it */
-        if (hot)
-            pgstat_info->t_counts.t_tuples_hot_updated++;
-    }
-}
-
-/*
- * pgstat_count_heap_delete - count a tuple deletion
- */
-void
-pgstat_count_heap_delete(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_info->trans->tuples_deleted++;
-    }
-}
-
-/*
- * pgstat_truncate_save_counters
- *
- * Whenever a table is truncated, we save its i/u/d counters so that they can
- * be cleared, and if the (sub)xact that executed the truncate later aborts,
- * the counters can be restored to the saved (pre-truncate) values.  Note we do
- * this on the first truncate in any particular subxact level only.
- */
-static void
-pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
-{
-    if (!trans->truncated)
-    {
-        trans->inserted_pre_trunc = trans->tuples_inserted;
-        trans->updated_pre_trunc = trans->tuples_updated;
-        trans->deleted_pre_trunc = trans->tuples_deleted;
-        trans->truncated = true;
-    }
-}
-
-/*
- * pgstat_truncate_restore_counters - restore counters when a truncate aborts
- */
-static void
-pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
-{
-    if (trans->truncated)
-    {
-        trans->tuples_inserted = trans->inserted_pre_trunc;
-        trans->tuples_updated = trans->updated_pre_trunc;
-        trans->tuples_deleted = trans->deleted_pre_trunc;
-    }
-}
-
-/*
- * pgstat_count_truncate - update tuple counters due to truncate
- */
-void
-pgstat_count_truncate(Relation rel)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-    {
-        /* We have to log the effect at the proper transactional level */
-        int            nest_level = GetCurrentTransactionNestLevel();
-
-        if (pgstat_info->trans == NULL ||
-            pgstat_info->trans->nest_level != nest_level)
-            add_tabstat_xact_level(pgstat_info, nest_level);
-
-        pgstat_truncate_save_counters(pgstat_info->trans);
-        pgstat_info->trans->tuples_inserted = 0;
-        pgstat_info->trans->tuples_updated = 0;
-        pgstat_info->trans->tuples_deleted = 0;
-    }
-}
-
-/*
- * pgstat_update_heap_dead_tuples - update dead-tuples count
- *
- * The semantics of this are that we are reporting the nontransactional
- * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
- * rather than increasing, and the change goes straight into the per-table
- * counter, not into transactional state.
- */
-void
-pgstat_update_heap_dead_tuples(Relation rel, int delta)
-{
-    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
-
-    if (pgstat_info != NULL)
-        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
-}
-
-
-/* ----------
- * AtEOXact_PgStat
- *
- *    Called from access/transam/xact.c at top-level transaction commit/abort.
- * ----------
- */
-void
-AtEOXact_PgStat(bool isCommit)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Count transaction commit or abort.  (We use counters, not just bools,
-     * in case the reporting message isn't sent right away.)
-     */
-    if (isCommit)
-        pgStatXactCommit++;
-    else
-        pgStatXactRollback++;
-
-    /*
-     * Transfer transactional insert/update counts into the base tabstat
-     * entries.  We don't bother to free any of the transactional state, since
-     * it's all in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            /* restore pre-truncate stats (if any) in case of aborted xact */
-            if (!isCommit)
-                pgstat_truncate_restore_counters(trans);
-            /* count attempted actions regardless of commit/abort */
-            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-            if (isCommit)
-            {
-                tabstat->t_counts.t_truncated = trans->truncated;
-                if (trans->truncated)
-                {
-                    /* forget live/dead stats seen by backend thus far */
-                    tabstat->t_counts.t_delta_live_tuples = 0;
-                    tabstat->t_counts.t_delta_dead_tuples = 0;
-                }
-                /* insert adds a live tuple, delete removes one */
-                tabstat->t_counts.t_delta_live_tuples +=
-                    trans->tuples_inserted - trans->tuples_deleted;
-                /* update and delete each create a dead tuple */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_updated + trans->tuples_deleted;
-                /* insert, update, delete each count as one change event */
-                tabstat->t_counts.t_changed_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated +
-                    trans->tuples_deleted;
-            }
-            else
-            {
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                /* an aborted xact generates no changed_tuple events */
-            }
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/* ----------
- * AtEOSubXact_PgStat
- *
- *    Called from access/transam/xact.c at subtransaction commit/abort.
- * ----------
- */
-void
-AtEOSubXact_PgStat(bool isCommit, int nestDepth)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * Transfer transactional insert/update counts into the next higher
-     * subtransaction state.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL &&
-        xact_state->nest_level >= nestDepth)
-    {
-        PgStat_TableXactStatus *trans;
-        PgStat_TableXactStatus *next_trans;
-
-        /* delink xact_state from stack immediately to simplify reuse case */
-        pgStatXactStack = xact_state->prev;
-
-        for (trans = xact_state->first; trans != NULL; trans = next_trans)
-        {
-            PgStat_TableStatus *tabstat;
-
-            next_trans = trans->next;
-            Assert(trans->nest_level == nestDepth);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-            if (isCommit)
-            {
-                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
-                {
-                    if (trans->truncated)
-                    {
-                        /* propagate the truncate status one level up */
-                        pgstat_truncate_save_counters(trans->upper);
-                        /* replace upper xact stats with ours */
-                        trans->upper->tuples_inserted = trans->tuples_inserted;
-                        trans->upper->tuples_updated = trans->tuples_updated;
-                        trans->upper->tuples_deleted = trans->tuples_deleted;
-                    }
-                    else
-                    {
-                        trans->upper->tuples_inserted += trans->tuples_inserted;
-                        trans->upper->tuples_updated += trans->tuples_updated;
-                        trans->upper->tuples_deleted += trans->tuples_deleted;
-                    }
-                    tabstat->trans = trans->upper;
-                    pfree(trans);
-                }
-                else
-                {
-                    /*
-                     * When there isn't an immediate parent state, we can just
-                     * reuse the record instead of going through a
-                     * palloc/pfree pushup (this works since it's all in
-                     * TopTransactionContext anyway).  We have to re-link it
-                     * into the parent level, though, and that might mean
-                     * pushing a new entry into the pgStatXactStack.
-                     */
-                    PgStat_SubXactStatus *upper_xact_state;
-
-                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
-                    trans->next = upper_xact_state->first;
-                    upper_xact_state->first = trans;
-                    trans->nest_level = nestDepth - 1;
-                }
-            }
-            else
-            {
-                /*
-                 * On abort, update top-level tabstat counts, then forget the
-                 * subtransaction
-                 */
-
-                /* first restore values obliterated by truncate */
-                pgstat_truncate_restore_counters(trans);
-                /* count attempted actions regardless of commit/abort */
-                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
-                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
-                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
-                /* inserted tuples are dead, deleted tuples are unaffected */
-                tabstat->t_counts.t_delta_dead_tuples +=
-                    trans->tuples_inserted + trans->tuples_updated;
-                tabstat->trans = trans->upper;
-                pfree(trans);
-            }
-        }
-        pfree(xact_state);
-    }
-}
-
-
-/*
- * AtPrepare_PgStat
- *        Save the transactional stats state at 2PC transaction prepare.
- *
- * In this phase we just generate 2PC records for all the pending
- * transaction-dependent stats work.
- */
-void
-AtPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        Assert(xact_state->nest_level == 1);
-        Assert(xact_state->prev == NULL);
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-            TwoPhasePgStatRecord record;
-
-            Assert(trans->nest_level == 1);
-            Assert(trans->upper == NULL);
-            tabstat = trans->parent;
-            Assert(tabstat->trans == trans);
-
-            record.tuples_inserted = trans->tuples_inserted;
-            record.tuples_updated = trans->tuples_updated;
-            record.tuples_deleted = trans->tuples_deleted;
-            record.inserted_pre_trunc = trans->inserted_pre_trunc;
-            record.updated_pre_trunc = trans->updated_pre_trunc;
-            record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
-            record.t_truncated = trans->truncated;
-
-            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
-                                   &record, sizeof(TwoPhasePgStatRecord));
-        }
-    }
-}
-
-/*
- * PostPrepare_PgStat
- *        Clean up after successful PREPARE.
- *
- * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
- *
- * Note: AtEOXact_PgStat is not called during PREPARE.
- */
-void
-PostPrepare_PgStat(void)
-{
-    PgStat_SubXactStatus *xact_state;
-
-    /*
-     * We don't bother to free any of the transactional state, since it's all
-     * in TopTransactionContext and will go away anyway.
-     */
-    xact_state = pgStatXactStack;
-    if (xact_state != NULL)
-    {
-        PgStat_TableXactStatus *trans;
-
-        for (trans = xact_state->first; trans != NULL; trans = trans->next)
-        {
-            PgStat_TableStatus *tabstat;
-
-            tabstat = trans->parent;
-            tabstat->trans = NULL;
-        }
-    }
-    pgStatXactStack = NULL;
-
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
-}
-
-/*
- * 2PC processing routine for COMMIT PREPARED case.
- *
- * Load the saved counts into our local pgstats state.
- */
-void
-pgstat_twophase_postcommit(TransactionId xid, uint16 info,
-                           void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, commit case */
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_truncated = rec->t_truncated;
-    if (rec->t_truncated)
-    {
-        /* forget live/dead stats seen by backend thus far */
-        pgstat_info->t_counts.t_delta_live_tuples = 0;
-        pgstat_info->t_counts.t_delta_dead_tuples = 0;
-    }
-    pgstat_info->t_counts.t_delta_live_tuples +=
-        rec->tuples_inserted - rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_updated + rec->tuples_deleted;
-    pgstat_info->t_counts.t_changed_tuples +=
-        rec->tuples_inserted + rec->tuples_updated +
-        rec->tuples_deleted;
-}
-
-/*
- * 2PC processing routine for ROLLBACK PREPARED case.
- *
- * Load the saved counts into our local pgstats state, but treat them
- * as aborted.
- */
-void
-pgstat_twophase_postabort(TransactionId xid, uint16 info,
-                          void *recdata, uint32 len)
-{
-    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
-    PgStat_TableStatus *pgstat_info;
-
-    /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
-
-    /* Same math as in AtEOXact_PgStat, abort case */
-    if (rec->t_truncated)
-    {
-        rec->tuples_inserted = rec->inserted_pre_trunc;
-        rec->tuples_updated = rec->updated_pre_trunc;
-        rec->tuples_deleted = rec->deleted_pre_trunc;
-    }
-    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
-    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
-    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
-    pgstat_info->t_counts.t_delta_dead_tuples +=
-        rec->tuples_inserted + rec->tuples_updated;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_dbentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
-{
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
-}
-
-
-/* ----------
- * pgstat_fetch_stat_tabentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
- *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
- */
-PgStat_StatTabEntry *
-pgstat_fetch_stat_tabentry(Oid relid)
-{
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    /*
-     * If we didn't find it, maybe it's a shared table.
-     */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_funcentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one function or NULL.
- * ----------
- */
-PgStat_StatFuncEntry *
-pgstat_fetch_stat_funcentry(Oid func_id)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1].backendStatus;
-}
-
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
- *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
- */
-LocalPgBackendStatus *
-pgstat_fetch_stat_local_beentry(int beid)
-{
-    pgstat_read_current_status();
-
-    if (beid < 1 || beid > localNumBackends)
-        return NULL;
-
-    return &localBackendStatusTable[beid - 1];
-}
-
-
-/* ----------
- * pgstat_fetch_stat_numbackends() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the maximum current backend id.
- * ----------
- */
-int
-pgstat_fetch_stat_numbackends(void)
-{
-    pgstat_read_current_status();
-
-    return localNumBackends;
-}
-
-/*
- * ---------
- * pgstat_fetch_stat_archiver() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
- * ---------
- */
-PgStat_ArchiverStats *
-pgstat_fetch_stat_archiver(void)
-{
-    backend_read_statsfile();
-
-    return &archiverStats;
-}
-
-
-/*
- * ---------
- * pgstat_fetch_global() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
- * ---------
- */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
-{
-    backend_read_statsfile();
-
-    return &globalStats;
-}
-
-
-/* ------------------------------------------------------------
- * Functions for management of the shared-memory PgBackendStatus array
- * ------------------------------------------------------------
- */
-
-static PgBackendStatus *BackendStatusArray = NULL;
-static PgBackendStatus *MyBEEntry = NULL;
-static char *BackendAppnameBuffer = NULL;
-static char *BackendClientHostnameBuffer = NULL;
-static char *BackendActivityBuffer = NULL;
-static Size BackendActivityBufferSize = 0;
-#ifdef USE_SSL
-static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
-#endif
-
-
-/*
- * Report shared-memory space needed by CreateSharedBackendStatus.
- */
-Size
-BackendStatusShmemSize(void)
-{
-    Size        size;
-
-    /* BackendStatusArray: */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    /* BackendAppnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendClientHostnameBuffer: */
-    size = add_size(size,
-                    mul_size(NAMEDATALEN, NumBackendStatSlots));
-    /* BackendActivityBuffer: */
-    size = add_size(size,
-                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
-#ifdef USE_SSL
-    /* BackendSslStatusBuffer: */
-    size = add_size(size,
-                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
-#endif
-    return size;
-}
-
-/*
- * Initialize the shared status array and several string buffers
- * during postmaster startup.
- */
-void
-CreateSharedBackendStatus(void)
-{
-    Size        size;
-    bool        found;
-    int            i;
-    char       *buffer;
-
-    /* Create or attach to the shared array */
-    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
-    BackendStatusArray = (PgBackendStatus *)
-        ShmemInitStruct("Backend Status Array", size, &found);
-
-    if (!found)
-    {
-        /*
-         * We're the first - initialize.
-         */
-        MemSet(BackendStatusArray, 0, size);
-    }
-
-    /* Create or attach to the shared appname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendAppnameBuffer = (char *)
-        ShmemInitStruct("Backend Application Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendAppnameBuffer, 0, size);
-
-        /* Initialize st_appname pointers. */
-        buffer = BackendAppnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_appname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared client hostname buffer */
-    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
-    BackendClientHostnameBuffer = (char *)
-        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
-
-    if (!found)
-    {
-        MemSet(BackendClientHostnameBuffer, 0, size);
-
-        /* Initialize st_clienthostname pointers. */
-        buffer = BackendClientHostnameBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_clienthostname = buffer;
-            buffer += NAMEDATALEN;
-        }
-    }
-
-    /* Create or attach to the shared activity buffer */
-    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
-                                         NumBackendStatSlots);
-    BackendActivityBuffer = (char *)
-        ShmemInitStruct("Backend Activity Buffer",
-                        BackendActivityBufferSize,
-                        &found);
-
-    if (!found)
-    {
-        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
-
-        /* Initialize st_activity pointers. */
-        buffer = BackendActivityBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_activity_raw = buffer;
-            buffer += pgstat_track_activity_query_size;
-        }
-    }
-
-#ifdef USE_SSL
-    /* Create or attach to the shared SSL status buffer */
-    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
-    BackendSslStatusBuffer = (PgBackendSSLStatus *)
-        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
-
-    if (!found)
-    {
-        PgBackendSSLStatus *ptr;
-
-        MemSet(BackendSslStatusBuffer, 0, size);
-
-        /* Initialize st_sslstatus pointers. */
-        ptr = BackendSslStatusBuffer;
-        for (i = 0; i < NumBackendStatSlots; i++)
-        {
-            BackendStatusArray[i].st_sslstatus = ptr;
-            ptr++;
-        }
-    }
-#endif
-}
-
-
-/* ----------
- * pgstat_initialize() -
- *
- *    Initialize pgstats state, and set up our on-proc-exit hook.
- *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
- *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
- *    but we must not have started any transaction yet (since the
- *    exit hook must run after the last transaction exit).
- *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
- * ----------
- */
-void
-pgstat_initialize(void)
-{
-    /* Initialize MyBEEntry */
-    if (MyBackendId != InvalidBackendId)
-    {
-        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
-        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-
-        /*
-         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
-         * have a BackendId, the slot is statically allocated based on the
-         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
-         * in the range from 1 to MaxBackends (inclusive), so we use
-         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
-         * auxiliary process.
-         */
-        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
-    }
-
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
-}
-
-/* ----------
- * pgstat_bestart() -
- *
- *    Initialize this backend's entry in the PgBackendStatus array.
- *    Called from InitPostgres.
- *
- *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
- *    session userid, and application_name must be set for a
- *    backend (hence, this cannot be combined with pgstat_initialize).
- * ----------
- */
-void
-pgstat_bestart(void)
-{
-    SockAddr    clientaddr;
-    volatile PgBackendStatus *beentry;
-
-    /*
-     * To minimize the time spent modifying the PgBackendStatus entry, fetch
-     * all the needed data first.
-     */
-
-    /*
-     * We may not have a MyProcPort (eg, if this is the autovacuum process).
-     * If so, use all-zeroes client address, which is dealt with specially in
-     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
-     */
-    if (MyProcPort)
-        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
-    else
-        MemSet(&clientaddr, 0, sizeof(clientaddr));
-
-    /*
-     * Initialize my status entry, following the protocol of bumping
-     * st_changecount before and after; and make sure it's even afterwards. We
-     * use a volatile pointer here to ensure the compiler doesn't try to get
-     * cute.
-     */
-    beentry = MyBEEntry;
-
-    /* pgstats state must be initialized from pgstat_initialize() */
-    Assert(beentry != NULL);
-
-    if (MyBackendId != InvalidBackendId)
-    {
-        if (IsAutoVacuumLauncherProcess())
-        {
-            /* Autovacuum Launcher */
-            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
-        }
-        else if (IsAutoVacuumWorkerProcess())
-        {
-            /* Autovacuum Worker */
-            beentry->st_backendType = B_AUTOVAC_WORKER;
-        }
-        else if (am_walsender)
-        {
-            /* Wal sender */
-            beentry->st_backendType = B_WAL_SENDER;
-        }
-        else if (IsBackgroundWorker)
-        {
-            /* bgworker */
-            beentry->st_backendType = B_BG_WORKER;
-        }
-        else
-        {
-            /* client-backend */
-            beentry->st_backendType = B_BACKEND;
-        }
-    }
-    else
-    {
-        /* Must be an auxiliary process */
-        Assert(MyAuxProcType != NotAnAuxProcess);
-        switch (MyAuxProcType)
-        {
-            case StartupProcess:
-                beentry->st_backendType = B_STARTUP;
-                break;
-            case BgWriterProcess:
-                beentry->st_backendType = B_BG_WRITER;
-                break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
-            case CheckpointerProcess:
-                beentry->st_backendType = B_CHECKPOINTER;
-                break;
-            case WalWriterProcess:
-                beentry->st_backendType = B_WAL_WRITER;
-                break;
-            case WalReceiverProcess:
-                beentry->st_backendType = B_WAL_RECEIVER;
-                break;
-            default:
-                elog(FATAL, "unrecognized process type: %d",
-                     (int) MyAuxProcType);
-                proc_exit(1);
-        }
-    }
-
-    do
-    {
-        pgstat_increment_changecount_before(beentry);
-    } while ((beentry->st_changecount & 1) == 0);
-
-    beentry->st_procpid = MyProcPid;
-    beentry->st_proc_start_timestamp = MyStartTimestamp;
-    beentry->st_activity_start_timestamp = 0;
-    beentry->st_state_start_timestamp = 0;
-    beentry->st_xact_start_timestamp = 0;
-    beentry->st_databaseid = MyDatabaseId;
-
-    /* We have userid for client-backends, wal-sender and bgworker processes */
-    if (beentry->st_backendType == B_BACKEND
-        || beentry->st_backendType == B_WAL_SENDER
-        || beentry->st_backendType == B_BG_WORKER)
-        beentry->st_userid = GetSessionUserId();
-    else
-        beentry->st_userid = InvalidOid;
-
-    beentry->st_clientaddr = clientaddr;
-    if (MyProcPort && MyProcPort->remote_hostname)
-        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
-                NAMEDATALEN);
-    else
-        beentry->st_clienthostname[0] = '\0';
-#ifdef USE_SSL
-    if (MyProcPort && MyProcPort->ssl != NULL)
-    {
-        beentry->st_ssl = true;
-        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
-        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
-        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
-        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
-        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
-        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
-        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
-    }
-    else
-    {
-        beentry->st_ssl = false;
-    }
-#else
-    beentry->st_ssl = false;
-#endif
-    beentry->st_state = STATE_UNDEFINED;
-    beentry->st_appname[0] = '\0';
-    beentry->st_activity_raw[0] = '\0';
-    /* Also make sure the last byte in each string area is always 0 */
-    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
-    beentry->st_appname[NAMEDATALEN - 1] = '\0';
-    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-
-    /*
-     * we don't zero st_progress_param here to save cycles; nobody should
-     * examine it until st_progress_command has been set to something other
-     * than PROGRESS_COMMAND_INVALID
-     */
-
-    pgstat_increment_changecount_after(beentry);
-
-    /* Update app name to current GUC setting */
-    if (application_name)
-        pgstat_report_appname(application_name);
-}
-
-/*
- * Shut down a single backend's statistics reporting at process exit.
- *
- * Flush any remaining statistics counts out to the collector.
- * Without this, operations triggered during backend exit (such as
- * temp table deletions) won't be counted.
- *
- * Lastly, clear out our entry in the PgBackendStatus array.
- */
-static void
-pgstat_beshutdown_hook(int code, Datum arg)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    /*
-     * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
-     * database ID, so forget it.  (This means that accesses to pg_database
-     * during failed backend starts might never get counted.)
-     */
-    if (OidIsValid(MyDatabaseId))
-        pgstat_report_stat(true);
-
-    /*
-     * Clear my status entry, following the protocol of bumping st_changecount
-     * before and after.  We use a volatile pointer here to ensure the
-     * compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_procpid = 0;    /* mark invalid */
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-
-/* ----------
- * pgstat_report_activity() -
- *
- *    Called from tcop/postgres.c to report what the backend is actually doing
- *    (but note cmd_str can be NULL for certain cases).
- *
- * All updates of the status entry follow the protocol of bumping
- * st_changecount before and after.  We use a volatile pointer here to
- * ensure the compiler doesn't try to get cute.
- * ----------
- */
-void
-pgstat_report_activity(BackendState state, const char *cmd_str)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    TimestampTz start_timestamp;
-    TimestampTz current_timestamp;
-    int            len = 0;
-
-    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
-
-    if (!beentry)
-        return;
-
-    if (!pgstat_track_activities)
-    {
-        if (beentry->st_state != STATE_DISABLED)
-        {
-            volatile PGPROC *proc = MyProc;
-
-            /*
-             * track_activities is disabled, but we last reported a
-             * non-disabled state.  As our final update, change the state and
-             * clear fields we will not be updating anymore.
-             */
-            pgstat_increment_changecount_before(beentry);
-            beentry->st_state = STATE_DISABLED;
-            beentry->st_state_start_timestamp = 0;
-            beentry->st_activity_raw[0] = '\0';
-            beentry->st_activity_start_timestamp = 0;
-            /* st_xact_start_timestamp and wait_event_info are also disabled */
-            beentry->st_xact_start_timestamp = 0;
-            proc->wait_event_info = 0;
-            pgstat_increment_changecount_after(beentry);
-        }
-        return;
-    }
-
-    /*
-     * To minimize the time spent modifying the entry, fetch all the needed
-     * data first.
-     */
-    start_timestamp = GetCurrentStatementStartTimestamp();
-    if (cmd_str != NULL)
-    {
-        /*
-         * Compute length of to-be-stored string unaware of multi-byte
-         * characters. For speed reasons that'll get corrected on read, rather
-         * than computed every write.
-         */
-        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
-    }
-    current_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Now update the status entry
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    beentry->st_state = state;
-    beentry->st_state_start_timestamp = current_timestamp;
-
-    if (cmd_str != NULL)
-    {
-        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
-        beentry->st_activity_raw[len] = '\0';
-        beentry->st_activity_start_timestamp = start_timestamp;
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_start_command() -
- *
- * Set st_progress_command (and st_progress_command_target) in own backend
- * entry.  Also, zero-initialize st_progress_param array.
- *-----------
- */
-void
-pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = cmdtype;
-    beentry->st_progress_command_target = relid;
-    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_param() -
- *
- * Update index'th member in st_progress_param[] of own backend entry.
- *-----------
- */
-void
-pgstat_progress_update_param(int index, int64 val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
-
-    if (!beentry || !pgstat_track_activities)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_param[index] = val;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_update_multi_param() -
- *
- * Update multiple members in st_progress_param[] of own backend entry.
- * This is atomic; readers won't see intermediate states.
- *-----------
- */
-void
-pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            i;
-
-    if (!beentry || !pgstat_track_activities || nparam == 0)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-
-    for (i = 0; i < nparam; ++i)
-    {
-        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
-
-        beentry->st_progress_param[index[i]] = val[i];
-    }
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*-----------
- * pgstat_progress_end_command() -
- *
- * Reset st_progress_command (and st_progress_command_target) in own backend
- * entry.  This signals the end of the command.
- *-----------
- */
-void
-pgstat_progress_end_command(void)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!beentry)
-        return;
-    if (!pgstat_track_activities
-        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
-        return;
-
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
-    beentry->st_progress_command_target = InvalidOid;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_report_appname() -
- *
- *    Called to update our application name.
- * ----------
- */
-void
-pgstat_report_appname(const char *appname)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-    int            len;
-
-    if (!beentry)
-        return;
-
-    /* This should be unnecessary if GUC did its job, but be safe */
-    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-
-    memcpy((char *) beentry->st_appname, appname, len);
-    beentry->st_appname[len] = '\0';
-
-    pgstat_increment_changecount_after(beentry);
-}
-
-/*
- * Report current transaction start timestamp as the specified value.
- * Zero means there is no active transaction.
- */
-void
-pgstat_report_xact_timestamp(TimestampTz tstamp)
-{
-    volatile PgBackendStatus *beentry = MyBEEntry;
-
-    if (!pgstat_track_activities || !beentry)
-        return;
-
-    /*
-     * Update my status entry, following the protocol of bumping
-     * st_changecount before and after.  We use a volatile pointer here to
-     * ensure the compiler doesn't try to get cute.
-     */
-    pgstat_increment_changecount_before(beentry);
-    beentry->st_xact_start_timestamp = tstamp;
-    pgstat_increment_changecount_after(beentry);
-}
-
-/* ----------
- * pgstat_read_current_status() -
- *
- *    Copy the current contents of the PgBackendStatus array to local memory,
- *    if not already done in this transaction.
- * ----------
- */
-static void
-pgstat_read_current_status(void)
-{
-    volatile PgBackendStatus *beentry;
-    LocalPgBackendStatus *localtable;
-    LocalPgBackendStatus *localentry;
-    char       *localappname,
-               *localclienthostname,
-               *localactivity;
-#ifdef USE_SSL
-    PgBackendSSLStatus *localsslstatus;
-#endif
-    int            i;
-
-    Assert(!pgStatRunningInCollector);
-    if (localBackendStatusTable)
-        return;                    /* already done */
-
-    pgstat_setup_memcxt();
-
-    localtable = (LocalPgBackendStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-    localappname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localclienthostname = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           NAMEDATALEN * NumBackendStatSlots);
-    localactivity = (char *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           pgstat_track_activity_query_size * NumBackendStatSlots);
-#ifdef USE_SSL
-    localsslstatus = (PgBackendSSLStatus *)
-        MemoryContextAlloc(pgStatLocalContext,
-                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
-#endif
-
-    localNumBackends = 0;
-
-    beentry = BackendStatusArray;
-    localentry = localtable;
-    for (i = 1; i <= NumBackendStatSlots; i++)
-    {
-        /*
-         * Follow the protocol of retrying if st_changecount changes while we
-         * copy the entry, or if it's odd.  (The check for odd is needed to
-         * cover the case where we are able to completely copy the entry while
-         * the source backend is between increment steps.)    We use a volatile
-         * pointer here to ensure the compiler doesn't try to get cute.
-         */
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(beentry, before_changecount);
-
-            localentry->backendStatus.st_procpid = beentry->st_procpid;
-            if (localentry->backendStatus.st_procpid > 0)
-            {
-                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
-
-                /*
-                 * strcpy is safe even if the string is modified concurrently,
-                 * because there's always a \0 at the end of the buffer.
-                 */
-                strcpy(localappname, (char *) beentry->st_appname);
-                localentry->backendStatus.st_appname = localappname;
-                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-                localentry->backendStatus.st_clienthostname = localclienthostname;
-                strcpy(localactivity, (char *) beentry->st_activity_raw);
-                localentry->backendStatus.st_activity_raw = localactivity;
-                localentry->backendStatus.st_ssl = beentry->st_ssl;
-#ifdef USE_SSL
-                if (beentry->st_ssl)
-                {
-                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-                    localentry->backendStatus.st_sslstatus = localsslstatus;
-                }
-#endif
-            }
-
-            pgstat_save_changecount_after(beentry, after_changecount);
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        beentry++;
-        /* Only valid entries get included into the local array */
-        if (localentry->backendStatus.st_procpid > 0)
-        {
-            BackendIdGetTransactionIds(i,
-                                       &localentry->backend_xid,
-                                       &localentry->backend_xmin);
-
-            localentry++;
-            localappname += NAMEDATALEN;
-            localclienthostname += NAMEDATALEN;
-            localactivity += pgstat_track_activity_query_size;
-#ifdef USE_SSL
-            localsslstatus++;
-#endif
-            localNumBackends++;
-        }
-    }
-
-    /* Set the pointer only after completion of a valid table */
-    localBackendStatusTable = localtable;
-}
-
-/* ----------
- * pgstat_get_wait_event_type() -
- *
- *    Return a string representing the current wait event type, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event_type(uint32 wait_event_info)
-{
-    uint32        classId;
-    const char *event_type;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_type = "LWLock";
-            break;
-        case PG_WAIT_LOCK:
-            event_type = "Lock";
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_type = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            event_type = "Activity";
-            break;
-        case PG_WAIT_CLIENT:
-            event_type = "Client";
-            break;
-        case PG_WAIT_EXTENSION:
-            event_type = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            event_type = "IPC";
-            break;
-        case PG_WAIT_TIMEOUT:
-            event_type = "Timeout";
-            break;
-        case PG_WAIT_IO:
-            event_type = "IO";
-            break;
-        default:
-            event_type = "???";
-            break;
-    }
-
-    return event_type;
-}
-
-/* ----------
- * pgstat_get_wait_event() -
- *
- *    Return a string representing the current wait event, backend is
- *    waiting on.
- */
-const char *
-pgstat_get_wait_event(uint32 wait_event_info)
-{
-    uint32        classId;
-    uint16        eventId;
-    const char *event_name;
-
-    /* report process as not waiting. */
-    if (wait_event_info == 0)
-        return NULL;
-
-    classId = wait_event_info & 0xFF000000;
-    eventId = wait_event_info & 0x0000FFFF;
-
-    switch (classId)
-    {
-        case PG_WAIT_LWLOCK:
-            event_name = GetLWLockIdentifier(classId, eventId);
-            break;
-        case PG_WAIT_LOCK:
-            event_name = GetLockNameFromTagType(eventId);
-            break;
-        case PG_WAIT_BUFFER_PIN:
-            event_name = "BufferPin";
-            break;
-        case PG_WAIT_ACTIVITY:
-            {
-                WaitEventActivity w = (WaitEventActivity) wait_event_info;
-
-                event_name = pgstat_get_wait_activity(w);
-                break;
-            }
-        case PG_WAIT_CLIENT:
-            {
-                WaitEventClient w = (WaitEventClient) wait_event_info;
-
-                event_name = pgstat_get_wait_client(w);
-                break;
-            }
-        case PG_WAIT_EXTENSION:
-            event_name = "Extension";
-            break;
-        case PG_WAIT_IPC:
-            {
-                WaitEventIPC w = (WaitEventIPC) wait_event_info;
-
-                event_name = pgstat_get_wait_ipc(w);
-                break;
-            }
-        case PG_WAIT_TIMEOUT:
-            {
-                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
-
-                event_name = pgstat_get_wait_timeout(w);
-                break;
-            }
-        case PG_WAIT_IO:
-            {
-                WaitEventIO w = (WaitEventIO) wait_event_info;
-
-                event_name = pgstat_get_wait_io(w);
-                break;
-            }
-        default:
-            event_name = "unknown wait event";
-            break;
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_activity() -
- *
- * Convert WaitEventActivity to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_activity(WaitEventActivity w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_ARCHIVER_MAIN:
-            event_name = "ArchiverMain";
-            break;
-        case WAIT_EVENT_AUTOVACUUM_MAIN:
-            event_name = "AutoVacuumMain";
-            break;
-        case WAIT_EVENT_BGWRITER_HIBERNATE:
-            event_name = "BgWriterHibernate";
-            break;
-        case WAIT_EVENT_BGWRITER_MAIN:
-            event_name = "BgWriterMain";
-            break;
-        case WAIT_EVENT_CHECKPOINTER_MAIN:
-            event_name = "CheckpointerMain";
-            break;
-        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
-            event_name = "LogicalApplyMain";
-            break;
-        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
-            event_name = "LogicalLauncherMain";
-            break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_ALL:
-            event_name = "RecoveryWalAll";
-            break;
-        case WAIT_EVENT_RECOVERY_WAL_STREAM:
-            event_name = "RecoveryWalStream";
-            break;
-        case WAIT_EVENT_SYSLOGGER_MAIN:
-            event_name = "SysLoggerMain";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_MAIN:
-            event_name = "WalReceiverMain";
-            break;
-        case WAIT_EVENT_WAL_SENDER_MAIN:
-            event_name = "WalSenderMain";
-            break;
-        case WAIT_EVENT_WAL_WRITER_MAIN:
-            event_name = "WalWriterMain";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_client() -
- *
- * Convert WaitEventClient to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_client(WaitEventClient w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_CLIENT_READ:
-            event_name = "ClientRead";
-            break;
-        case WAIT_EVENT_CLIENT_WRITE:
-            event_name = "ClientWrite";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
-            event_name = "LibPQWalReceiverConnect";
-            break;
-        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
-            event_name = "LibPQWalReceiverReceive";
-            break;
-        case WAIT_EVENT_SSL_OPEN_SERVER:
-            event_name = "SSLOpenServer";
-            break;
-        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
-            event_name = "WalReceiverWaitStart";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
-            event_name = "WalSenderWaitForWAL";
-            break;
-        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
-            event_name = "WalSenderWriteData";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_ipc() -
- *
- * Convert WaitEventIPC to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_ipc(WaitEventIPC w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BGWORKER_SHUTDOWN:
-            event_name = "BgWorkerShutdown";
-            break;
-        case WAIT_EVENT_BGWORKER_STARTUP:
-            event_name = "BgWorkerStartup";
-            break;
-        case WAIT_EVENT_BTREE_PAGE:
-            event_name = "BtreePage";
-            break;
-        case WAIT_EVENT_CLOG_GROUP_UPDATE:
-            event_name = "ClogGroupUpdate";
-            break;
-        case WAIT_EVENT_EXECUTE_GATHER:
-            event_name = "ExecuteGather";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
-            event_name = "Hash/Batch/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BATCH_ELECTING:
-            event_name = "Hash/Batch/Electing";
-            break;
-        case WAIT_EVENT_HASH_BATCH_LOADING:
-            event_name = "Hash/Batch/Loading";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
-            event_name = "Hash/Build/Allocating";
-            break;
-        case WAIT_EVENT_HASH_BUILD_ELECTING:
-            event_name = "Hash/Build/Electing";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
-            event_name = "Hash/Build/HashingInner";
-            break;
-        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
-            event_name = "Hash/Build/HashingOuter";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
-            event_name = "Hash/GrowBatches/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
-            event_name = "Hash/GrowBatches/Deciding";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
-            event_name = "Hash/GrowBatches/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
-            event_name = "Hash/GrowBatches/Finishing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
-            event_name = "Hash/GrowBatches/Repartitioning";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
-            event_name = "Hash/GrowBuckets/Allocating";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
-            event_name = "Hash/GrowBuckets/Electing";
-            break;
-        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
-            event_name = "Hash/GrowBuckets/Reinserting";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_DATA:
-            event_name = "LogicalSyncData";
-            break;
-        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
-            event_name = "LogicalSyncStateChange";
-            break;
-        case WAIT_EVENT_MQ_INTERNAL:
-            event_name = "MessageQueueInternal";
-            break;
-        case WAIT_EVENT_MQ_PUT_MESSAGE:
-            event_name = "MessageQueuePutMessage";
-            break;
-        case WAIT_EVENT_MQ_RECEIVE:
-            event_name = "MessageQueueReceive";
-            break;
-        case WAIT_EVENT_MQ_SEND:
-            event_name = "MessageQueueSend";
-            break;
-        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
-            event_name = "ParallelBitmapScan";
-            break;
-        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
-            event_name = "ParallelCreateIndexScan";
-            break;
-        case WAIT_EVENT_PARALLEL_FINISH:
-            event_name = "ParallelFinish";
-            break;
-        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
-            event_name = "ProcArrayGroupUpdate";
-            break;
-        case WAIT_EVENT_PROMOTE:
-            event_name = "Promote";
-            break;
-        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
-            event_name = "ReplicationOriginDrop";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_DROP:
-            event_name = "ReplicationSlotDrop";
-            break;
-        case WAIT_EVENT_SAFE_SNAPSHOT:
-            event_name = "SafeSnapshot";
-            break;
-        case WAIT_EVENT_SYNC_REP:
-            event_name = "SyncRep";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_timeout() -
- *
- * Convert WaitEventTimeout to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_timeout(WaitEventTimeout w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
-            event_name = "BaseBackupThrottle";
-            break;
-        case WAIT_EVENT_PG_SLEEP:
-            event_name = "PgSleep";
-            break;
-        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
-            event_name = "RecoveryApplyDelay";
-            break;
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-/* ----------
- * pgstat_get_wait_io() -
- *
- * Convert WaitEventIO to string.
- * ----------
- */
-static const char *
-pgstat_get_wait_io(WaitEventIO w)
-{
-    const char *event_name = "unknown wait event";
-
-    switch (w)
-    {
-        case WAIT_EVENT_BUFFILE_READ:
-            event_name = "BufFileRead";
-            break;
-        case WAIT_EVENT_BUFFILE_WRITE:
-            event_name = "BufFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_READ:
-            event_name = "ControlFileRead";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC:
-            event_name = "ControlFileSync";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
-            event_name = "ControlFileSyncUpdate";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE:
-            event_name = "ControlFileWrite";
-            break;
-        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
-            event_name = "ControlFileWriteUpdate";
-            break;
-        case WAIT_EVENT_COPY_FILE_READ:
-            event_name = "CopyFileRead";
-            break;
-        case WAIT_EVENT_COPY_FILE_WRITE:
-            event_name = "CopyFileWrite";
-            break;
-        case WAIT_EVENT_DATA_FILE_EXTEND:
-            event_name = "DataFileExtend";
-            break;
-        case WAIT_EVENT_DATA_FILE_FLUSH:
-            event_name = "DataFileFlush";
-            break;
-        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
-            event_name = "DataFileImmediateSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_PREFETCH:
-            event_name = "DataFilePrefetch";
-            break;
-        case WAIT_EVENT_DATA_FILE_READ:
-            event_name = "DataFileRead";
-            break;
-        case WAIT_EVENT_DATA_FILE_SYNC:
-            event_name = "DataFileSync";
-            break;
-        case WAIT_EVENT_DATA_FILE_TRUNCATE:
-            event_name = "DataFileTruncate";
-            break;
-        case WAIT_EVENT_DATA_FILE_WRITE:
-            event_name = "DataFileWrite";
-            break;
-        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
-            event_name = "DSMFillZeroWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
-            event_name = "LockFileAddToDataDirRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
-            event_name = "LockFileAddToDataDirSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
-            event_name = "LockFileAddToDataDirWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
-            event_name = "LockFileCreateRead";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
-            event_name = "LockFileCreateSync";
-            break;
-        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
-            event_name = "LockFileCreateWrite";
-            break;
-        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
-            event_name = "LockFileReCheckDataDirRead";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
-            event_name = "LogicalRewriteCheckpointSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
-            event_name = "LogicalRewriteMappingSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
-            event_name = "LogicalRewriteMappingWrite";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
-            event_name = "LogicalRewriteSync";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
-            event_name = "LogicalRewriteTruncate";
-            break;
-        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
-            event_name = "LogicalRewriteWrite";
-            break;
-        case WAIT_EVENT_RELATION_MAP_READ:
-            event_name = "RelationMapRead";
-            break;
-        case WAIT_EVENT_RELATION_MAP_SYNC:
-            event_name = "RelationMapSync";
-            break;
-        case WAIT_EVENT_RELATION_MAP_WRITE:
-            event_name = "RelationMapWrite";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_READ:
-            event_name = "ReorderBufferRead";
-            break;
-        case WAIT_EVENT_REORDER_BUFFER_WRITE:
-            event_name = "ReorderBufferWrite";
-            break;
-        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
-            event_name = "ReorderLogicalMappingRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_READ:
-            event_name = "ReplicationSlotRead";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
-            event_name = "ReplicationSlotRestoreSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
-            event_name = "ReplicationSlotSync";
-            break;
-        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
-            event_name = "ReplicationSlotWrite";
-            break;
-        case WAIT_EVENT_SLRU_FLUSH_SYNC:
-            event_name = "SLRUFlushSync";
-            break;
-        case WAIT_EVENT_SLRU_READ:
-            event_name = "SLRURead";
-            break;
-        case WAIT_EVENT_SLRU_SYNC:
-            event_name = "SLRUSync";
-            break;
-        case WAIT_EVENT_SLRU_WRITE:
-            event_name = "SLRUWrite";
-            break;
-        case WAIT_EVENT_SNAPBUILD_READ:
-            event_name = "SnapbuildRead";
-            break;
-        case WAIT_EVENT_SNAPBUILD_SYNC:
-            event_name = "SnapbuildSync";
-            break;
-        case WAIT_EVENT_SNAPBUILD_WRITE:
-            event_name = "SnapbuildWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
-            event_name = "TimelineHistoryFileSync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
-            event_name = "TimelineHistoryFileWrite";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_READ:
-            event_name = "TimelineHistoryRead";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
-            event_name = "TimelineHistorySync";
-            break;
-        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
-            event_name = "TimelineHistoryWrite";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_READ:
-            event_name = "TwophaseFileRead";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
-            event_name = "TwophaseFileSync";
-            break;
-        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
-            event_name = "TwophaseFileWrite";
-            break;
-        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
-            event_name = "WALSenderTimelineHistoryRead";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
-            event_name = "WALBootstrapSync";
-            break;
-        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
-            event_name = "WALBootstrapWrite";
-            break;
-        case WAIT_EVENT_WAL_COPY_READ:
-            event_name = "WALCopyRead";
-            break;
-        case WAIT_EVENT_WAL_COPY_SYNC:
-            event_name = "WALCopySync";
-            break;
-        case WAIT_EVENT_WAL_COPY_WRITE:
-            event_name = "WALCopyWrite";
-            break;
-        case WAIT_EVENT_WAL_INIT_SYNC:
-            event_name = "WALInitSync";
-            break;
-        case WAIT_EVENT_WAL_INIT_WRITE:
-            event_name = "WALInitWrite";
-            break;
-        case WAIT_EVENT_WAL_READ:
-            event_name = "WALRead";
-            break;
-        case WAIT_EVENT_WAL_SYNC:
-            event_name = "WALSync";
-            break;
-        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
-            event_name = "WALSyncMethodAssign";
-            break;
-        case WAIT_EVENT_WAL_WRITE:
-            event_name = "WALWrite";
-            break;
-
-            /* no default case, so that compiler will warn */
-    }
-
-    return event_name;
-}
-
-
-/* ----------
- * pgstat_get_backend_current_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  This looks directly at the BackendStatusArray,
- *    and so will provide current information regardless of the age of our
- *    transaction's snapshot of the status array.
- *
- *    It is the caller's responsibility to invoke this only for backends whose
- *    state is expected to remain stable while the result is in use.  The
- *    only current use is in deadlock reporting, where we can expect that
- *    the target backend is blocked on a lock.  (There are corner cases
- *    where the target's wait could get aborted while we are looking at it,
- *    but the very worst consequence is to return a pointer to a string
- *    that's been changed, so we won't worry too much.)
- *
- *    Note: return strings for special cases match pg_stat_get_backend_activity.
- * ----------
- */
-const char *
-pgstat_get_backend_current_activity(int pid, bool checkUser)
-{
-    PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        /*
-         * Although we expect the target backend's entry to be stable, that
-         * doesn't imply that anyone else's is.  To avoid identifying the
-         * wrong backend, while we check for a match to the desired PID we
-         * must follow the protocol of retrying if st_changecount changes
-         * while we examine the entry, or if it's odd.  (This might be
-         * unnecessary, since fetching or storing an int is almost certainly
-         * atomic, but let's play it safe.)  We use a volatile pointer here to
-         * ensure the compiler doesn't try to get cute.
-         */
-        volatile PgBackendStatus *vbeentry = beentry;
-        bool        found;
-
-        for (;;)
-        {
-            int            before_changecount;
-            int            after_changecount;
-
-            pgstat_save_changecount_before(vbeentry, before_changecount);
-
-            found = (vbeentry->st_procpid == pid);
-
-            pgstat_save_changecount_after(vbeentry, after_changecount);
-
-            if (before_changecount == after_changecount &&
-                (before_changecount & 1) == 0)
-                break;
-
-            /* Make sure we can break out of loop if stuck... */
-            CHECK_FOR_INTERRUPTS();
-        }
-
-        if (found)
-        {
-            /* Now it is safe to use the non-volatile pointer */
-            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
-                return "<insufficient privilege>";
-            else if (*(beentry->st_activity_raw) == '\0')
-                return "<command string not enabled>";
-            else
-            {
-                /* this'll leak a bit of memory, but that seems acceptable */
-                return pgstat_clip_activity(beentry->st_activity_raw);
-            }
-        }
-
-        beentry++;
-    }
-
-    /* If we get here, caller is in error ... */
-    return "<backend information not available>";
-}
-
-/* ----------
- * pgstat_get_crashed_backend_activity() -
- *
- *    Return a string representing the current activity of the backend with
- *    the specified PID.  Like the function above, but reads shared memory with
- *    the expectation that it may be corrupt.  On success, copy the string
- *    into the "buffer" argument and return that pointer.  On failure,
- *    return NULL.
- *
- *    This function is only intended to be used by the postmaster to report the
- *    query that crashed a backend.  In particular, no attempt is made to
- *    follow the correct concurrency protocol when accessing the
- *    BackendStatusArray.  But that's OK, in the worst case we'll return a
- *    corrupted message.  We also must take care not to trip on ereport(ERROR).
- * ----------
- */
-const char *
-pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
-{
-    volatile PgBackendStatus *beentry;
-    int            i;
-
-    beentry = BackendStatusArray;
-
-    /*
-     * We probably shouldn't get here before shared memory has been set up,
-     * but be safe.
-     */
-    if (beentry == NULL || BackendActivityBuffer == NULL)
-        return NULL;
-
-    for (i = 1; i <= MaxBackends; i++)
-    {
-        if (beentry->st_procpid == pid)
-        {
-            /* Read pointer just once, so it can't change after validation */
-            const char *activity = beentry->st_activity_raw;
-            const char *activity_last;
-
-            /*
-             * We mustn't access activity string before we verify that it
-             * falls within the BackendActivityBuffer. To make sure that the
-             * entire string including its ending is contained within the
-             * buffer, subtract one activity length from the buffer size.
-             */
-            activity_last = BackendActivityBuffer + BackendActivityBufferSize
-                - pgstat_track_activity_query_size;
-
-            if (activity < BackendActivityBuffer ||
-                activity > activity_last)
-                return NULL;
-
-            /* If no string available, no point in a report */
-            if (activity[0] == '\0')
-                return NULL;
-
-            /*
-             * Copy only ASCII-safe characters so we don't run into encoding
-             * problems when reporting the message; and be sure not to run off
-             * the end of memory.  As only ASCII characters are reported, it
-             * doesn't seem necessary to perform multibyte aware clipping.
-             */
-            ascii_safe_strlcpy(buffer, activity,
-                               Min(buflen, pgstat_track_activity_query_size));
-
-            return buffer;
-        }
-
-        beentry++;
-    }
-
-    /* PID not found */
-    return NULL;
-}
-
-const char *
-pgstat_get_backend_desc(BackendType backendType)
-{
-    const char *backendDesc = "unknown process type";
-
-    switch (backendType)
-    {
-        case B_AUTOVAC_LAUNCHER:
-            backendDesc = "autovacuum launcher";
-            break;
-        case B_AUTOVAC_WORKER:
-            backendDesc = "autovacuum worker";
-            break;
-        case B_BACKEND:
-            backendDesc = "client backend";
-            break;
-        case B_BG_WORKER:
-            backendDesc = "background worker";
-            break;
-        case B_BG_WRITER:
-            backendDesc = "background writer";
-            break;
-        case B_ARCHIVER:
-            backendDesc = "archiver";
-            break;
-        case B_CHECKPOINTER:
-            backendDesc = "checkpointer";
-            break;
-        case B_STARTUP:
-            backendDesc = "startup";
-            break;
-        case B_WAL_RECEIVER:
-            backendDesc = "walreceiver";
-            break;
-        case B_WAL_SENDER:
-            backendDesc = "walsender";
-            break;
-        case B_WAL_WRITER:
-            backendDesc = "walwriter";
-            break;
-    }
-
-    return backendDesc;
-}
-
-/* ------------------------------------------------------------
- * Local support functions follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* ----------
- * pgstat_send_bgwriter() -
- *
- *        Send bgwriter statistics to the collector
- * ----------
- */
-void
-pgstat_send_bgwriter(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
-
-    /*
-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
-     */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
-        return;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
-
-    /*
-     * Clear out the statistics buffer, so it can be re-used.
-     */
-    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
-/*
- * Convert a potentially unsafely truncated activity string (see
- * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
- * one.
- *
- * The returned string is allocated in the caller's memory context and may be
- * freed.
- */
-char *
-pgstat_clip_activity(const char *raw_activity)
-{
-    char       *activity;
-    int            rawlen;
-    int            cliplen;
-
-    /*
-     * Some callers, like pgstat_get_backend_current_activity(), do not
-     * guarantee that the buffer isn't concurrently modified. We try to take
-     * care that the buffer is always terminated by a NUL byte regardless, but
-     * let's still be paranoid about the string's length. In those cases the
-     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
-     * large.
-     */
-    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
-
-    /* now double-guaranteed to be NUL terminated */
-    rawlen = strlen(activity);
-
-    /*
-     * All supported server-encodings make it possible to determine the length
-     * of a multi-byte character from its first byte (this is not the case for
-     * client encodings, see GB18030). As st_activity is always stored using
-     * server encoding, this allows us to perform multi-byte aware truncation,
-     * even if the string earlier was truncated in the middle of a multi-byte
-     * character.
-     */
-    cliplen = pg_mbcliplen(activity, rawlen,
-                           pgstat_track_activity_query_size - 1);
-
-    activity[cliplen] = '\0';
-
-    return activity;
-}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a663a62fd5..a01b81a594 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/file_perm.h"
@@ -255,7 +256,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1302,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1756,11 +1749,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2595,8 +2583,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2927,8 +2913,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2995,13 +2979,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3076,22 +3053,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3550,22 +3511,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3761,8 +3706,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3801,8 +3744,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4003,8 +3945,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4977,18 +4917,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5101,12 +5029,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5976,7 +5898,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6029,8 +5950,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6262,7 +6181,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index d1ea46deb8..3accdf7bcf 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -31,11 +31,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 
+#include "bestatus.h"
 #include "lib/stringinfo.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/pg_list.h"
-#include "pgstat.h"
 #include "pgtime.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index a6fdba3f41..0de04159d5 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -45,9 +45,9 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/walwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..e30b2dbcf0 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,6 +17,7 @@
 #include <time.h>
 
 #include "access/xlog_internal.h"    /* for pg_start/stop_backup */
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "common/file_perm.h"
 #include "lib/stringinfo.h"
diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
index 7027737e67..75a3208f74 100644
--- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
+++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
@@ -22,11 +22,11 @@
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 55b91b5e12..ea1c7e643e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -19,7 +19,7 @@
 
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "access/heapam.h"
 #include "access/htup.h"
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index dad2b3d065..dbb7c57ebc 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -77,13 +77,12 @@
 #include "access/htup_details.h"
 #include "access/table.h"
 #include "access/xact.h"
-
+#include "bestatus.h"
 #include "catalog/indexing.h"
 #include "nodes/execnodes.h"
 
 #include "replication/origin.h"
 #include "replication/logical.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2b486b5e9f..b6d6013dd0 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -62,10 +62,10 @@
 #include "access/tuptoaster.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
 #include "replication/slot.h"
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ad44b2bf43..1b792f6626 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -126,7 +126,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 
-#include "pgstat.h"
+#include "bestatus.h"
 
 #include "replication/logical.h"
 #include "replication/reorderbuffer.h"
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 28f5fc23aa..475ce9def4 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -86,26 +86,28 @@
 #include "postgres.h"
 
 #include "miscadmin.h"
-#include "pgstat.h"
 
 #include "access/table.h"
 #include "access/xact.h"
 
+#include "bestatus.h"
+
 #include "catalog/pg_subscription_rel.h"
 #include "catalog/pg_type.h"
 
 #include "commands/copy.h"
 
 #include "parser/parse_relation.h"
+#include "pgstat.h"
 
 #include "replication/logicallauncher.h"
 #include "replication/logicalrelation.h"
 #include "replication/walreceiver.h"
 #include "replication/worker_internal.h"
 
-#include "utils/snapmgr.h"
 #include "storage/ipc.h"
 
+#include "utils/snapmgr.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -128,7 +130,7 @@ finish_sync_worker(void)
     if (IsTransactionState())
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_flush_stat(true);
     }
 
     /* And flush all writes. */
@@ -144,6 +146,9 @@ finish_sync_worker(void)
     /* Find the main apply worker and signal it. */
     logicalrep_worker_wakeup(MyLogicalRepWorker->subid, InvalidOid);
 
+    /* clean up retained statistics */
+    pgstat_flush_stat(true);
+    
     /* Stop gracefully */
     proc_exit(0);
 }
@@ -525,7 +530,7 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
     if (started_tx)
     {
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_flush_stat(false);
     }
 }
 
@@ -863,7 +868,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
                                            MyLogicalRepWorker->relstate,
                                            MyLogicalRepWorker->relstate_lsn);
                 CommitTransactionCommand();
-                pgstat_report_stat(false);
+                pgstat_flush_stat(false);
 
                 /*
                  * We want to do the table data sync in a single transaction.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index f9516515bc..9e542803e1 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -26,6 +26,7 @@
 #include "access/table.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_subscription.h"
@@ -477,7 +478,7 @@ apply_handle_commit(StringInfo s)
         replorigin_session_origin_timestamp = commit_data.committime;
 
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_flush_stat(false);
 
         store_flush_position(commit_data.end_lsn);
     }
@@ -1311,6 +1312,8 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
             }
 
             send_feedback(last_received, requestReply, requestReply);
+
+            pgstat_flush_stat(false);
         }
     }
 }
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 33b23b6b6d..c60e69302a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,9 +41,9 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "common/string.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
 #include "storage/proc.h"
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 6c160c13c6..02ec91d98e 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -75,8 +75,8 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/syncrep.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..bdca25499d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -57,7 +58,6 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9b143f361b..2b38c0c4f5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -56,6 +56,7 @@
 #include "access/xlog_internal.h"
 #include "access/xlogutils.h"
 
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
@@ -65,7 +66,6 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
-#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
diff --git a/src/backend/statmon/Makefile b/src/backend/statmon/Makefile
new file mode 100644
index 0000000000..64a04878e3
--- /dev/null
+++ b/src/backend/statmon/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for src/backend/statmon
+#
+# IDENTIFICATION
+#    src/backend/statmon/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/statmon
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = pgstat.o bestatus.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/statmon/bestatus.c b/src/backend/statmon/bestatus.c
new file mode 100644
index 0000000000..292312d05c
--- /dev/null
+++ b/src/backend/statmon/bestatus.c
@@ -0,0 +1,1781 @@
+/* ----------
+ * bestatus.c
+ *
+ *    Backend status monitor
+ *
+ *    Status data is stored in shared memory. Every backends updates and read it
+ *    individually.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/bestatus.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include "bestatus.h"
+
+#include "access/xact.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "replication/walsender.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/sinvaladt.h"
+#include "utils/ascii.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/probes.h"
+
+
+/* Status for backends including auxiliary */
+static LocalPgBackendStatus *localBackendStatusTable = NULL;
+
+/* Total number of backends including auxiliary */
+static int    localNumBackends = 0;
+
+/* ----------
+ * Total number of backends including auxiliary
+ *
+ * We reserve a slot for each possible BackendId, plus one for each
+ * possible auxiliary process type.  (This scheme assumes there is not
+ * more than one of any auxiliary process type at a time.) MaxBackends
+ * includes autovacuum workers and background workers as well.
+ * ----------
+ */
+#define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
+
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_activities = false;
+int            pgstat_track_activity_query_size = 1024;
+
+static MemoryContext pgBeStatLocalContext = NULL;
+
+/* ------------------------------------------------------------
+ * Functions for management of the shared-memory PgBackendStatus array
+ * ------------------------------------------------------------
+ */
+
+static PgBackendStatus *BackendStatusArray = NULL;
+static PgBackendStatus *MyBEEntry = NULL;
+static char *BackendAppnameBuffer = NULL;
+static char *BackendClientHostnameBuffer = NULL;
+static char *BackendActivityBuffer = NULL;
+static Size BackendActivityBufferSize = 0;
+#ifdef USE_SSL
+static PgBackendSSLStatus *BackendSslStatusBuffer = NULL;
+#endif
+
+static const char *pgstat_get_wait_activity(WaitEventActivity w);
+static const char *pgstat_get_wait_client(WaitEventClient w);
+static const char *pgstat_get_wait_ipc(WaitEventIPC w);
+static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
+static const char *pgstat_get_wait_io(WaitEventIO w);
+static void pgstat_setup_memcxt(void);
+static void bestatus_clear_snapshot(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/*
+ * Report shared-memory space needed by CreateSharedBackendStatus.
+ */
+Size
+BackendStatusShmemSize(void)
+{
+    Size        size;
+
+    /* BackendStatusArray: */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    /* BackendAppnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendClientHostnameBuffer: */
+    size = add_size(size,
+                    mul_size(NAMEDATALEN, NumBackendStatSlots));
+    /* BackendActivityBuffer: */
+    size = add_size(size,
+                    mul_size(pgstat_track_activity_query_size, NumBackendStatSlots));
+#ifdef USE_SSL
+    /* BackendSslStatusBuffer: */
+    size = add_size(size,
+                    mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots));
+#endif
+    return size;
+}
+
+/*
+ * Initialize the shared status array and several string buffers
+ * during postmaster startup.
+ */
+void
+CreateSharedBackendStatus(void)
+{
+    Size        size;
+    bool        found;
+    int            i;
+    char       *buffer;
+
+    /* Create or attach to the shared array */
+    size = mul_size(sizeof(PgBackendStatus), NumBackendStatSlots);
+    BackendStatusArray = (PgBackendStatus *)
+        ShmemInitStruct("Backend Status Array", size, &found);
+
+    if (!found)
+    {
+        /*
+         * We're the first - initialize.
+         */
+        MemSet(BackendStatusArray, 0, size);
+    }
+
+    /* Create or attach to the shared appname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendAppnameBuffer = (char *)
+        ShmemInitStruct("Backend Application Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendAppnameBuffer, 0, size);
+
+        /* Initialize st_appname pointers. */
+        buffer = BackendAppnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_appname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared client hostname buffer */
+    size = mul_size(NAMEDATALEN, NumBackendStatSlots);
+    BackendClientHostnameBuffer = (char *)
+        ShmemInitStruct("Backend Client Host Name Buffer", size, &found);
+
+    if (!found)
+    {
+        MemSet(BackendClientHostnameBuffer, 0, size);
+
+        /* Initialize st_clienthostname pointers. */
+        buffer = BackendClientHostnameBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_clienthostname = buffer;
+            buffer += NAMEDATALEN;
+        }
+    }
+
+    /* Create or attach to the shared activity buffer */
+    BackendActivityBufferSize = mul_size(pgstat_track_activity_query_size,
+                                         NumBackendStatSlots);
+    BackendActivityBuffer = (char *)
+        ShmemInitStruct("Backend Activity Buffer",
+                        BackendActivityBufferSize,
+                        &found);
+
+    if (!found)
+    {
+        MemSet(BackendActivityBuffer, 0, BackendActivityBufferSize);
+
+        /* Initialize st_activity pointers. */
+        buffer = BackendActivityBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_activity_raw = buffer;
+            buffer += pgstat_track_activity_query_size;
+        }
+    }
+
+#ifdef USE_SSL
+    /* Create or attach to the shared SSL status buffer */
+    size = mul_size(sizeof(PgBackendSSLStatus), NumBackendStatSlots);
+    BackendSslStatusBuffer = (PgBackendSSLStatus *)
+        ShmemInitStruct("Backend SSL Status Buffer", size, &found);
+
+    if (!found)
+    {
+        PgBackendSSLStatus *ptr;
+
+        MemSet(BackendSslStatusBuffer, 0, size);
+
+        /* Initialize st_sslstatus pointers. */
+        ptr = BackendSslStatusBuffer;
+        for (i = 0; i < NumBackendStatSlots; i++)
+        {
+            BackendStatusArray[i].st_sslstatus = ptr;
+            ptr++;
+        }
+    }
+#endif
+}
+
+/* ----------
+ * pgstat_bearray_initialize() -
+ *
+ *    Initialize pgstats state, and set up our on-proc-exit hook.
+ *    Called from InitPostgres and AuxiliaryProcessMain. For auxiliary process,
+ *    MyBackendId is invalid. Otherwise, MyBackendId must be set,
+ *    but we must not have started any transaction yet (since the
+ *    exit hook must run after the last transaction exit).
+ *    NOTE: MyDatabaseId isn't set yet; so the shutdown hook has to be careful.
+ * ----------
+ */
+void
+pgstat_bearray_initialize(void)
+{
+    /* Initialize MyBEEntry */
+    if (MyBackendId != InvalidBackendId)
+    {
+        Assert(MyBackendId >= 1 && MyBackendId <= MaxBackends);
+        MyBEEntry = &BackendStatusArray[MyBackendId - 1];
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+
+        /*
+         * Assign the MyBEEntry for an auxiliary process.  Since it doesn't
+         * have a BackendId, the slot is statically allocated based on the
+         * auxiliary process type (MyAuxProcType).  Backends use slots indexed
+         * in the range from 1 to MaxBackends (inclusive), so we use
+         * MaxBackends + AuxBackendType + 1 as the index of the slot for an
+         * auxiliary process.
+         */
+        MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
+    }
+
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the collector.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ *
+ * Lastly, clear out our entry in the PgBackendStatus array.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    /*
+     * Clear my status entry, following the protocol of bumping st_changecount
+     * before and after.  We use a volatile pointer here to ensure the
+     * compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_procpid = 0;    /* mark invalid */
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/* ----------
+ * pgstat_bestart() -
+ *
+ *    Initialize this backend's entry in the PgBackendStatus array.
+ *    Called from InitPostgres.
+ *
+ *    Apart from auxiliary processes, MyBackendId, MyDatabaseId,
+ *    session userid, and application_name must be set for a
+ *    backend (hence, this cannot be combined with pgstat_initialize).
+ * ----------
+ */
+void
+pgstat_bestart(void)
+{
+    SockAddr    clientaddr;
+    volatile PgBackendStatus *beentry;
+
+    /*
+     * To minimize the time spent modifying the PgBackendStatus entry, fetch
+     * all the needed data first.
+     */
+
+    /*
+     * We may not have a MyProcPort (eg, if this is the autovacuum process).
+     * If so, use all-zeroes client address, which is dealt with specially in
+     * pg_stat_get_backend_client_addr and pg_stat_get_backend_client_port.
+     */
+    if (MyProcPort)
+        memcpy(&clientaddr, &MyProcPort->raddr, sizeof(clientaddr));
+    else
+        MemSet(&clientaddr, 0, sizeof(clientaddr));
+
+    /*
+     * Initialize my status entry, following the protocol of bumping
+     * st_changecount before and after; and make sure it's even afterwards. We
+     * use a volatile pointer here to ensure the compiler doesn't try to get
+     * cute.
+     */
+    beentry = MyBEEntry;
+
+    /* pgstats state must be initialized from pgstat_initialize() */
+    Assert(beentry != NULL);
+
+    if (MyBackendId != InvalidBackendId)
+    {
+        if (IsAutoVacuumLauncherProcess())
+        {
+            /* Autovacuum Launcher */
+            beentry->st_backendType = B_AUTOVAC_LAUNCHER;
+        }
+        else if (IsAutoVacuumWorkerProcess())
+        {
+            /* Autovacuum Worker */
+            beentry->st_backendType = B_AUTOVAC_WORKER;
+        }
+        else if (am_walsender)
+        {
+            /* Wal sender */
+            beentry->st_backendType = B_WAL_SENDER;
+        }
+        else if (IsBackgroundWorker)
+        {
+            /* bgworker */
+            beentry->st_backendType = B_BG_WORKER;
+        }
+        else
+        {
+            /* client-backend */
+            beentry->st_backendType = B_BACKEND;
+        }
+    }
+    else
+    {
+        /* Must be an auxiliary process */
+        Assert(MyAuxProcType != NotAnAuxProcess);
+        switch (MyAuxProcType)
+        {
+            case StartupProcess:
+                beentry->st_backendType = B_STARTUP;
+                break;
+            case BgWriterProcess:
+                beentry->st_backendType = B_BG_WRITER;
+                break;
+            case CheckpointerProcess:
+                beentry->st_backendType = B_CHECKPOINTER;
+                break;
+            case WalWriterProcess:
+                beentry->st_backendType = B_WAL_WRITER;
+                break;
+            case WalReceiverProcess:
+                beentry->st_backendType = B_WAL_RECEIVER;
+                break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
+            default:
+                elog(FATAL, "unrecognized process type: %d",
+                     (int) MyAuxProcType);
+                proc_exit(1);
+        }
+    }
+
+    do
+    {
+        pgstat_increment_changecount_before(beentry);
+    } while ((beentry->st_changecount & 1) == 0);
+
+    beentry->st_procpid = MyProcPid;
+    beentry->st_proc_start_timestamp = MyStartTimestamp;
+    beentry->st_activity_start_timestamp = 0;
+    beentry->st_state_start_timestamp = 0;
+    beentry->st_xact_start_timestamp = 0;
+    beentry->st_databaseid = MyDatabaseId;
+
+    /* We have userid for client-backends, wal-sender and bgworker processes */
+    if (beentry->st_backendType == B_BACKEND
+        || beentry->st_backendType == B_WAL_SENDER
+        || beentry->st_backendType == B_BG_WORKER)
+        beentry->st_userid = GetSessionUserId();
+    else
+        beentry->st_userid = InvalidOid;
+
+    beentry->st_clientaddr = clientaddr;
+    if (MyProcPort && MyProcPort->remote_hostname)
+        strlcpy(beentry->st_clienthostname, MyProcPort->remote_hostname,
+                NAMEDATALEN);
+    else
+        beentry->st_clienthostname[0] = '\0';
+#ifdef USE_SSL
+    if (MyProcPort && MyProcPort->ssl != NULL)
+    {
+        beentry->st_ssl = true;
+        beentry->st_sslstatus->ssl_bits = be_tls_get_cipher_bits(MyProcPort);
+        beentry->st_sslstatus->ssl_compression = be_tls_get_compression(MyProcPort);
+        strlcpy(beentry->st_sslstatus->ssl_version, be_tls_get_version(MyProcPort), NAMEDATALEN);
+        strlcpy(beentry->st_sslstatus->ssl_cipher, be_tls_get_cipher(MyProcPort), NAMEDATALEN);
+        be_tls_get_peer_subject_name(MyProcPort, beentry->st_sslstatus->ssl_client_dn, NAMEDATALEN);
+        be_tls_get_peer_serial(MyProcPort, beentry->st_sslstatus->ssl_client_serial, NAMEDATALEN);
+        be_tls_get_peer_issuer_name(MyProcPort, beentry->st_sslstatus->ssl_issuer_dn, NAMEDATALEN);
+    }
+    else
+    {
+        beentry->st_ssl = false;
+    }
+#else
+    beentry->st_ssl = false;
+#endif
+    beentry->st_state = STATE_UNDEFINED;
+    beentry->st_appname[0] = '\0';
+    beentry->st_activity_raw[0] = '\0';
+    /* Also make sure the last byte in each string area is always 0 */
+    beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
+    beentry->st_appname[NAMEDATALEN - 1] = '\0';
+    beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+
+    /*
+     * we don't zero st_progress_param here to save cycles; nobody should
+     * examine it until st_progress_command has been set to something other
+     * than PROGRESS_COMMAND_INVALID
+     */
+
+    pgstat_increment_changecount_after(beentry);
+
+    /* Update app name to current GUC setting */
+    if (application_name)
+        pgstat_report_appname(application_name);
+}
+
+/* ----------
+ * pgstat_read_current_status() -
+ *
+ *    Copy the current contents of the PgBackendStatus array to local memory,
+ *    if not already done in this transaction.
+ * ----------
+ */
+static void
+pgstat_read_current_status(void)
+{
+    volatile PgBackendStatus *beentry;
+    LocalPgBackendStatus *localtable;
+    LocalPgBackendStatus *localentry;
+    char       *localappname,
+               *localclienthostname,
+               *localactivity;
+#ifdef USE_SSL
+    PgBackendSSLStatus *localsslstatus;
+#endif
+    int            i;
+
+    Assert(IsUnderPostmaster);
+
+    if (localBackendStatusTable)
+        return;                    /* already done */
+
+    pgstat_setup_memcxt();
+
+    localtable = (LocalPgBackendStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
+    localappname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localclienthostname = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           NAMEDATALEN * NumBackendStatSlots);
+    localactivity = (char *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           pgstat_track_activity_query_size * NumBackendStatSlots);
+#ifdef USE_SSL
+    localsslstatus = (PgBackendSSLStatus *)
+        MemoryContextAlloc(pgBeStatLocalContext,
+                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+#endif
+
+    localNumBackends = 0;
+
+    beentry = BackendStatusArray;
+    localentry = localtable;
+    for (i = 1; i <= NumBackendStatSlots; i++)
+    {
+        /*
+         * Follow the protocol of retrying if st_changecount changes while we
+         * copy the entry, or if it's odd.  (The check for odd is needed to
+         * cover the case where we are able to completely copy the entry while
+         * the source backend is between increment steps.)    We use a volatile
+         * pointer here to ensure the compiler doesn't try to get cute.
+         */
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(beentry, before_changecount);
+
+            localentry->backendStatus.st_procpid = beentry->st_procpid;
+            if (localentry->backendStatus.st_procpid > 0)
+            {
+                memcpy(&localentry->backendStatus, (char *) beentry, sizeof(PgBackendStatus));
+
+                /*
+                 * strcpy is safe even if the string is modified concurrently,
+                 * because there's always a \0 at the end of the buffer.
+                 */
+                strcpy(localappname, (char *) beentry->st_appname);
+                localentry->backendStatus.st_appname = localappname;
+                strcpy(localclienthostname, (char *) beentry->st_clienthostname);
+                localentry->backendStatus.st_clienthostname = localclienthostname;
+                strcpy(localactivity, (char *) beentry->st_activity_raw);
+                localentry->backendStatus.st_activity_raw = localactivity;
+                localentry->backendStatus.st_ssl = beentry->st_ssl;
+#ifdef USE_SSL
+                if (beentry->st_ssl)
+                {
+                    memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
+                    localentry->backendStatus.st_sslstatus = localsslstatus;
+                }
+#endif
+            }
+
+            pgstat_save_changecount_after(beentry, after_changecount);
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        beentry++;
+        /* Only valid entries get included into the local array */
+        if (localentry->backendStatus.st_procpid > 0)
+        {
+            BackendIdGetTransactionIds(i,
+                                       &localentry->backend_xid,
+                                       &localentry->backend_xmin);
+
+            localentry++;
+            localappname += NAMEDATALEN;
+            localclienthostname += NAMEDATALEN;
+            localactivity += pgstat_track_activity_query_size;
+#ifdef USE_SSL
+            localsslstatus++;
+#endif
+            localNumBackends++;
+        }
+    }
+
+    /* Set the pointer only after completion of a valid table */
+    localBackendStatusTable = localtable;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+LocalPgBackendStatus *
+pgstat_fetch_stat_local_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1];
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_numbackends() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the maximum current backend id.
+ * ----------
+ */
+int
+pgstat_fetch_stat_numbackends(void)
+{
+    pgstat_read_current_status();
+
+    return localNumBackends;
+}
+
+/* ----------
+ * pgstat_get_wait_event_type() -
+ *
+ *    Return a string representing the current wait event type, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event_type(uint32 wait_event_info)
+{
+    uint32        classId;
+    const char *event_type;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_type = "LWLock";
+            break;
+        case PG_WAIT_LOCK:
+            event_type = "Lock";
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_type = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            event_type = "Activity";
+            break;
+        case PG_WAIT_CLIENT:
+            event_type = "Client";
+            break;
+        case PG_WAIT_EXTENSION:
+            event_type = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            event_type = "IPC";
+            break;
+        case PG_WAIT_TIMEOUT:
+            event_type = "Timeout";
+            break;
+        case PG_WAIT_IO:
+            event_type = "IO";
+            break;
+        default:
+            event_type = "???";
+            break;
+    }
+
+    return event_type;
+}
+
+/* ----------
+ * pgstat_get_wait_event() -
+ *
+ *    Return a string representing the current wait event, backend is
+ *    waiting on.
+ */
+const char *
+pgstat_get_wait_event(uint32 wait_event_info)
+{
+    uint32        classId;
+    uint16        eventId;
+    const char *event_name;
+
+    /* report process as not waiting. */
+    if (wait_event_info == 0)
+        return NULL;
+
+    classId = wait_event_info & 0xFF000000;
+    eventId = wait_event_info & 0x0000FFFF;
+
+    switch (classId)
+    {
+        case PG_WAIT_LWLOCK:
+            event_name = GetLWLockIdentifier(classId, eventId);
+            break;
+        case PG_WAIT_LOCK:
+            event_name = GetLockNameFromTagType(eventId);
+            break;
+        case PG_WAIT_BUFFER_PIN:
+            event_name = "BufferPin";
+            break;
+        case PG_WAIT_ACTIVITY:
+            {
+                WaitEventActivity w = (WaitEventActivity) wait_event_info;
+
+                event_name = pgstat_get_wait_activity(w);
+                break;
+            }
+        case PG_WAIT_CLIENT:
+            {
+                WaitEventClient w = (WaitEventClient) wait_event_info;
+
+                event_name = pgstat_get_wait_client(w);
+                break;
+            }
+        case PG_WAIT_EXTENSION:
+            event_name = "Extension";
+            break;
+        case PG_WAIT_IPC:
+            {
+                WaitEventIPC w = (WaitEventIPC) wait_event_info;
+
+                event_name = pgstat_get_wait_ipc(w);
+                break;
+            }
+        case PG_WAIT_TIMEOUT:
+            {
+                WaitEventTimeout w = (WaitEventTimeout) wait_event_info;
+
+                event_name = pgstat_get_wait_timeout(w);
+                break;
+            }
+        case PG_WAIT_IO:
+            {
+                WaitEventIO w = (WaitEventIO) wait_event_info;
+
+                event_name = pgstat_get_wait_io(w);
+                break;
+            }
+        default:
+            event_name = "unknown wait event";
+            break;
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_activity() -
+ *
+ * Convert WaitEventActivity to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_activity(WaitEventActivity w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_ARCHIVER_MAIN:
+            event_name = "ArchiverMain";
+            break;
+        case WAIT_EVENT_AUTOVACUUM_MAIN:
+            event_name = "AutoVacuumMain";
+            break;
+        case WAIT_EVENT_BGWRITER_HIBERNATE:
+            event_name = "BgWriterHibernate";
+            break;
+        case WAIT_EVENT_BGWRITER_MAIN:
+            event_name = "BgWriterMain";
+            break;
+        case WAIT_EVENT_CHECKPOINTER_MAIN:
+            event_name = "CheckpointerMain";
+            break;
+        case WAIT_EVENT_LOGICAL_APPLY_MAIN:
+            event_name = "LogicalApplyMain";
+            break;
+        case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
+            event_name = "LogicalLauncherMain";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_ALL:
+            event_name = "RecoveryWalAll";
+            break;
+        case WAIT_EVENT_RECOVERY_WAL_STREAM:
+            event_name = "RecoveryWalStream";
+            break;
+        case WAIT_EVENT_SYSLOGGER_MAIN:
+            event_name = "SysLoggerMain";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_MAIN:
+            event_name = "WalReceiverMain";
+            break;
+        case WAIT_EVENT_WAL_SENDER_MAIN:
+            event_name = "WalSenderMain";
+            break;
+        case WAIT_EVENT_WAL_WRITER_MAIN:
+            event_name = "WalWriterMain";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_client() -
+ *
+ * Convert WaitEventClient to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_client(WaitEventClient w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_CLIENT_READ:
+            event_name = "ClientRead";
+            break;
+        case WAIT_EVENT_CLIENT_WRITE:
+            event_name = "ClientWrite";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_CONNECT:
+            event_name = "LibPQWalReceiverConnect";
+            break;
+        case WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE:
+            event_name = "LibPQWalReceiverReceive";
+            break;
+        case WAIT_EVENT_SSL_OPEN_SERVER:
+            event_name = "SSLOpenServer";
+            break;
+        case WAIT_EVENT_WAL_RECEIVER_WAIT_START:
+            event_name = "WalReceiverWaitStart";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WAIT_WAL:
+            event_name = "WalSenderWaitForWAL";
+            break;
+        case WAIT_EVENT_WAL_SENDER_WRITE_DATA:
+            event_name = "WalSenderWriteData";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_ipc() -
+ *
+ * Convert WaitEventIPC to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_ipc(WaitEventIPC w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BGWORKER_SHUTDOWN:
+            event_name = "BgWorkerShutdown";
+            break;
+        case WAIT_EVENT_BGWORKER_STARTUP:
+            event_name = "BgWorkerStartup";
+            break;
+        case WAIT_EVENT_BTREE_PAGE:
+            event_name = "BtreePage";
+            break;
+        case WAIT_EVENT_CLOG_GROUP_UPDATE:
+            event_name = "ClogGroupUpdate";
+            break;
+        case WAIT_EVENT_EXECUTE_GATHER:
+            event_name = "ExecuteGather";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ALLOCATING:
+            event_name = "Hash/Batch/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BATCH_ELECTING:
+            event_name = "Hash/Batch/Electing";
+            break;
+        case WAIT_EVENT_HASH_BATCH_LOADING:
+            event_name = "Hash/Batch/Loading";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ALLOCATING:
+            event_name = "Hash/Build/Allocating";
+            break;
+        case WAIT_EVENT_HASH_BUILD_ELECTING:
+            event_name = "Hash/Build/Electing";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_INNER:
+            event_name = "Hash/Build/HashingInner";
+            break;
+        case WAIT_EVENT_HASH_BUILD_HASHING_OUTER:
+            event_name = "Hash/Build/HashingOuter";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING:
+            event_name = "Hash/GrowBatches/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_DECIDING:
+            event_name = "Hash/GrowBatches/Deciding";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_ELECTING:
+            event_name = "Hash/GrowBatches/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_FINISHING:
+            event_name = "Hash/GrowBatches/Finishing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING:
+            event_name = "Hash/GrowBatches/Repartitioning";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING:
+            event_name = "Hash/GrowBuckets/Allocating";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING:
+            event_name = "Hash/GrowBuckets/Electing";
+            break;
+        case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING:
+            event_name = "Hash/GrowBuckets/Reinserting";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_DATA:
+            event_name = "LogicalSyncData";
+            break;
+        case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
+            event_name = "LogicalSyncStateChange";
+            break;
+        case WAIT_EVENT_MQ_INTERNAL:
+            event_name = "MessageQueueInternal";
+            break;
+        case WAIT_EVENT_MQ_PUT_MESSAGE:
+            event_name = "MessageQueuePutMessage";
+            break;
+        case WAIT_EVENT_MQ_RECEIVE:
+            event_name = "MessageQueueReceive";
+            break;
+        case WAIT_EVENT_MQ_SEND:
+            event_name = "MessageQueueSend";
+            break;
+        case WAIT_EVENT_PARALLEL_BITMAP_SCAN:
+            event_name = "ParallelBitmapScan";
+            break;
+        case WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN:
+            event_name = "ParallelCreateIndexScan";
+            break;
+        case WAIT_EVENT_PARALLEL_FINISH:
+            event_name = "ParallelFinish";
+            break;
+        case WAIT_EVENT_PROCARRAY_GROUP_UPDATE:
+            event_name = "ProcArrayGroupUpdate";
+            break;
+        case WAIT_EVENT_PROMOTE:
+            event_name = "Promote";
+            break;
+        case WAIT_EVENT_REPLICATION_ORIGIN_DROP:
+            event_name = "ReplicationOriginDrop";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_DROP:
+            event_name = "ReplicationSlotDrop";
+            break;
+        case WAIT_EVENT_SAFE_SNAPSHOT:
+            event_name = "SafeSnapshot";
+            break;
+        case WAIT_EVENT_SYNC_REP:
+            event_name = "SyncRep";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_timeout() -
+ *
+ * Convert WaitEventTimeout to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_timeout(WaitEventTimeout w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BASE_BACKUP_THROTTLE:
+            event_name = "BaseBackupThrottle";
+            break;
+        case WAIT_EVENT_PG_SLEEP:
+            event_name = "PgSleep";
+            break;
+        case WAIT_EVENT_RECOVERY_APPLY_DELAY:
+            event_name = "RecoveryApplyDelay";
+            break;
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+/* ----------
+ * pgstat_get_wait_io() -
+ *
+ * Convert WaitEventIO to string.
+ * ----------
+ */
+static const char *
+pgstat_get_wait_io(WaitEventIO w)
+{
+    const char *event_name = "unknown wait event";
+
+    switch (w)
+    {
+        case WAIT_EVENT_BUFFILE_READ:
+            event_name = "BufFileRead";
+            break;
+        case WAIT_EVENT_BUFFILE_WRITE:
+            event_name = "BufFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_READ:
+            event_name = "ControlFileRead";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC:
+            event_name = "ControlFileSync";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE:
+            event_name = "ControlFileSyncUpdate";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE:
+            event_name = "ControlFileWrite";
+            break;
+        case WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE:
+            event_name = "ControlFileWriteUpdate";
+            break;
+        case WAIT_EVENT_COPY_FILE_READ:
+            event_name = "CopyFileRead";
+            break;
+        case WAIT_EVENT_COPY_FILE_WRITE:
+            event_name = "CopyFileWrite";
+            break;
+        case WAIT_EVENT_DATA_FILE_EXTEND:
+            event_name = "DataFileExtend";
+            break;
+        case WAIT_EVENT_DATA_FILE_FLUSH:
+            event_name = "DataFileFlush";
+            break;
+        case WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC:
+            event_name = "DataFileImmediateSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_PREFETCH:
+            event_name = "DataFilePrefetch";
+            break;
+        case WAIT_EVENT_DATA_FILE_READ:
+            event_name = "DataFileRead";
+            break;
+        case WAIT_EVENT_DATA_FILE_SYNC:
+            event_name = "DataFileSync";
+            break;
+        case WAIT_EVENT_DATA_FILE_TRUNCATE:
+            event_name = "DataFileTruncate";
+            break;
+        case WAIT_EVENT_DATA_FILE_WRITE:
+            event_name = "DataFileWrite";
+            break;
+        case WAIT_EVENT_DSM_FILL_ZERO_WRITE:
+            event_name = "DSMFillZeroWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ:
+            event_name = "LockFileAddToDataDirRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC:
+            event_name = "LockFileAddToDataDirSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE:
+            event_name = "LockFileAddToDataDirWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_READ:
+            event_name = "LockFileCreateRead";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_SYNC:
+            event_name = "LockFileCreateSync";
+            break;
+        case WAIT_EVENT_LOCK_FILE_CREATE_WRITE:
+            event_name = "LockFileCreateWrite";
+            break;
+        case WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ:
+            event_name = "LockFileReCheckDataDirRead";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC:
+            event_name = "LogicalRewriteCheckpointSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC:
+            event_name = "LogicalRewriteMappingSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE:
+            event_name = "LogicalRewriteMappingWrite";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_SYNC:
+            event_name = "LogicalRewriteSync";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE:
+            event_name = "LogicalRewriteTruncate";
+            break;
+        case WAIT_EVENT_LOGICAL_REWRITE_WRITE:
+            event_name = "LogicalRewriteWrite";
+            break;
+        case WAIT_EVENT_RELATION_MAP_READ:
+            event_name = "RelationMapRead";
+            break;
+        case WAIT_EVENT_RELATION_MAP_SYNC:
+            event_name = "RelationMapSync";
+            break;
+        case WAIT_EVENT_RELATION_MAP_WRITE:
+            event_name = "RelationMapWrite";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_READ:
+            event_name = "ReorderBufferRead";
+            break;
+        case WAIT_EVENT_REORDER_BUFFER_WRITE:
+            event_name = "ReorderBufferWrite";
+            break;
+        case WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ:
+            event_name = "ReorderLogicalMappingRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_READ:
+            event_name = "ReplicationSlotRead";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC:
+            event_name = "ReplicationSlotRestoreSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_SYNC:
+            event_name = "ReplicationSlotSync";
+            break;
+        case WAIT_EVENT_REPLICATION_SLOT_WRITE:
+            event_name = "ReplicationSlotWrite";
+            break;
+        case WAIT_EVENT_SLRU_FLUSH_SYNC:
+            event_name = "SLRUFlushSync";
+            break;
+        case WAIT_EVENT_SLRU_READ:
+            event_name = "SLRURead";
+            break;
+        case WAIT_EVENT_SLRU_SYNC:
+            event_name = "SLRUSync";
+            break;
+        case WAIT_EVENT_SLRU_WRITE:
+            event_name = "SLRUWrite";
+            break;
+        case WAIT_EVENT_SNAPBUILD_READ:
+            event_name = "SnapbuildRead";
+            break;
+        case WAIT_EVENT_SNAPBUILD_SYNC:
+            event_name = "SnapbuildSync";
+            break;
+        case WAIT_EVENT_SNAPBUILD_WRITE:
+            event_name = "SnapbuildWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC:
+            event_name = "TimelineHistoryFileSync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE:
+            event_name = "TimelineHistoryFileWrite";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_READ:
+            event_name = "TimelineHistoryRead";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_SYNC:
+            event_name = "TimelineHistorySync";
+            break;
+        case WAIT_EVENT_TIMELINE_HISTORY_WRITE:
+            event_name = "TimelineHistoryWrite";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_READ:
+            event_name = "TwophaseFileRead";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_SYNC:
+            event_name = "TwophaseFileSync";
+            break;
+        case WAIT_EVENT_TWOPHASE_FILE_WRITE:
+            event_name = "TwophaseFileWrite";
+            break;
+        case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
+            event_name = "WALSenderTimelineHistoryRead";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_SYNC:
+            event_name = "WALBootstrapSync";
+            break;
+        case WAIT_EVENT_WAL_BOOTSTRAP_WRITE:
+            event_name = "WALBootstrapWrite";
+            break;
+        case WAIT_EVENT_WAL_COPY_READ:
+            event_name = "WALCopyRead";
+            break;
+        case WAIT_EVENT_WAL_COPY_SYNC:
+            event_name = "WALCopySync";
+            break;
+        case WAIT_EVENT_WAL_COPY_WRITE:
+            event_name = "WALCopyWrite";
+            break;
+        case WAIT_EVENT_WAL_INIT_SYNC:
+            event_name = "WALInitSync";
+            break;
+        case WAIT_EVENT_WAL_INIT_WRITE:
+            event_name = "WALInitWrite";
+            break;
+        case WAIT_EVENT_WAL_READ:
+            event_name = "WALRead";
+            break;
+        case WAIT_EVENT_WAL_SYNC:
+            event_name = "WALSync";
+            break;
+        case WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN:
+            event_name = "WALSyncMethodAssign";
+            break;
+        case WAIT_EVENT_WAL_WRITE:
+            event_name = "WALWrite";
+            break;
+
+            /* no default case, so that compiler will warn */
+    }
+
+    return event_name;
+}
+
+
+/* ----------
+ * pgstat_get_backend_current_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  This looks directly at the BackendStatusArray,
+ *    and so will provide current information regardless of the age of our
+ *    transaction's snapshot of the status array.
+ *
+ *    It is the caller's responsibility to invoke this only for backends whose
+ *    state is expected to remain stable while the result is in use.  The
+ *    only current use is in deadlock reporting, where we can expect that
+ *    the target backend is blocked on a lock.  (There are corner cases
+ *    where the target's wait could get aborted while we are looking at it,
+ *    but the very worst consequence is to return a pointer to a string
+ *    that's been changed, so we won't worry too much.)
+ *
+ *    Note: return strings for special cases match pg_stat_get_backend_activity.
+ * ----------
+ */
+const char *
+pgstat_get_backend_current_activity(int pid, bool checkUser)
+{
+    PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        /*
+         * Although we expect the target backend's entry to be stable, that
+         * doesn't imply that anyone else's is.  To avoid identifying the
+         * wrong backend, while we check for a match to the desired PID we
+         * must follow the protocol of retrying if st_changecount changes
+         * while we examine the entry, or if it's odd.  (This might be
+         * unnecessary, since fetching or storing an int is almost certainly
+         * atomic, but let's play it safe.)  We use a volatile pointer here to
+         * ensure the compiler doesn't try to get cute.
+         */
+        volatile PgBackendStatus *vbeentry = beentry;
+        bool        found;
+
+        for (;;)
+        {
+            int            before_changecount;
+            int            after_changecount;
+
+            pgstat_save_changecount_before(vbeentry, before_changecount);
+
+            found = (vbeentry->st_procpid == pid);
+
+            pgstat_save_changecount_after(vbeentry, after_changecount);
+
+            if (before_changecount == after_changecount &&
+                (before_changecount & 1) == 0)
+                break;
+
+            /* Make sure we can break out of loop if stuck... */
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (found)
+        {
+            /* Now it is safe to use the non-volatile pointer */
+            if (checkUser && !superuser() && beentry->st_userid != GetUserId())
+                return "<insufficient privilege>";
+            else if (*(beentry->st_activity_raw) == '\0')
+                return "<command string not enabled>";
+            else
+            {
+                /* this'll leak a bit of memory, but that seems acceptable */
+                return pgstat_clip_activity(beentry->st_activity_raw);
+            }
+        }
+
+        beentry++;
+    }
+
+    /* If we get here, caller is in error ... */
+    return "<backend information not available>";
+}
+
+/* ----------
+ * pgstat_get_crashed_backend_activity() -
+ *
+ *    Return a string representing the current activity of the backend with
+ *    the specified PID.  Like the function above, but reads shared memory with
+ *    the expectation that it may be corrupt.  On success, copy the string
+ *    into the "buffer" argument and return that pointer.  On failure,
+ *    return NULL.
+ *
+ *    This function is only intended to be used by the postmaster to report the
+ *    query that crashed a backend.  In particular, no attempt is made to
+ *    follow the correct concurrency protocol when accessing the
+ *    BackendStatusArray.  But that's OK, in the worst case we'll return a
+ *    corrupted message.  We also must take care not to trip on ereport(ERROR).
+ * ----------
+ */
+const char *
+pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
+{
+    volatile PgBackendStatus *beentry;
+    int            i;
+
+    beentry = BackendStatusArray;
+
+    /*
+     * We probably shouldn't get here before shared memory has been set up,
+     * but be safe.
+     */
+    if (beentry == NULL || BackendActivityBuffer == NULL)
+        return NULL;
+
+    for (i = 1; i <= MaxBackends; i++)
+    {
+        if (beentry->st_procpid == pid)
+        {
+            /* Read pointer just once, so it can't change after validation */
+            const char *activity = beentry->st_activity_raw;
+            const char *activity_last;
+
+            /*
+             * We mustn't access activity string before we verify that it
+             * falls within the BackendActivityBuffer. To make sure that the
+             * entire string including its ending is contained within the
+             * buffer, subtract one activity length from the buffer size.
+             */
+            activity_last = BackendActivityBuffer + BackendActivityBufferSize
+                - pgstat_track_activity_query_size;
+
+            if (activity < BackendActivityBuffer ||
+                activity > activity_last)
+                return NULL;
+
+            /* If no string available, no point in a report */
+            if (activity[0] == '\0')
+                return NULL;
+
+            /*
+             * Copy only ASCII-safe characters so we don't run into encoding
+             * problems when reporting the message; and be sure not to run off
+             * the end of memory.  As only ASCII characters are reported, it
+             * doesn't seem necessary to perform multibyte aware clipping.
+             */
+            ascii_safe_strlcpy(buffer, activity,
+                               Min(buflen, pgstat_track_activity_query_size));
+
+            return buffer;
+        }
+
+        beentry++;
+    }
+
+    /* PID not found */
+    return NULL;
+}
+
+const char *
+pgstat_get_backend_desc(BackendType backendType)
+{
+    const char *backendDesc = "unknown process type";
+
+    switch (backendType)
+    {
+        case B_AUTOVAC_LAUNCHER:
+            backendDesc = "autovacuum launcher";
+            break;
+        case B_AUTOVAC_WORKER:
+            backendDesc = "autovacuum worker";
+            break;
+        case B_BACKEND:
+            backendDesc = "client backend";
+            break;
+        case B_BG_WORKER:
+            backendDesc = "background worker";
+            break;
+        case B_BG_WRITER:
+            backendDesc = "background writer";
+            break;
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
+        case B_CHECKPOINTER:
+            backendDesc = "checkpointer";
+            break;
+        case B_STARTUP:
+            backendDesc = "startup";
+            break;
+        case B_WAL_RECEIVER:
+            backendDesc = "walreceiver";
+            break;
+        case B_WAL_SENDER:
+            backendDesc = "walsender";
+            break;
+        case B_WAL_WRITER:
+            backendDesc = "walwriter";
+            break;
+    }
+
+    return backendDesc;
+}
+
+/* ----------
+ * pgstat_report_appname() -
+ *
+ *    Called to update our application name.
+ * ----------
+ */
+void
+pgstat_report_appname(const char *appname)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            len;
+
+    if (!beentry)
+        return;
+
+    /* This should be unnecessary if GUC did its job, but be safe */
+    len = pg_mbcliplen(appname, strlen(appname), NAMEDATALEN - 1);
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    memcpy((char *) beentry->st_appname, appname, len);
+    beentry->st_appname[len] = '\0';
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*
+ * Report current transaction start timestamp as the specified value.
+ * Zero means there is no active transaction.
+ */
+void
+pgstat_report_xact_timestamp(TimestampTz tstamp)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!pgstat_track_activities || !beentry)
+        return;
+
+    /*
+     * Update my status entry, following the protocol of bumping
+     * st_changecount before and after.  We use a volatile pointer here to
+     * ensure the compiler doesn't try to get cute.
+     */
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_xact_start_timestamp = tstamp;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgBeStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgBeStatLocalContext)
+        pgBeStatLocalContext = AllocSetContextCreate(TopMemoryContext,
+                                                     "Backend status snapshot",
+                                                     ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * AtEOXact_BEStatus
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_BEStatus(bool isCommit)
+{
+    bestatus_clear_snapshot();
+}
+
+/*
+ * AtPrepare_BEStatus
+ *        Clear existing snapshot at 2PC transaction prepare.
+ */
+void
+AtPrepare_BEStatus(void)
+{
+    bestatus_clear_snapshot();
+}
+
+/* ----------
+ * bestatus_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+static void
+bestatus_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgBeStatLocalContext)
+        MemoryContextDelete(pgBeStatLocalContext);
+
+    /* Reset variables */
+    pgBeStatLocalContext = NULL;
+    localBackendStatusTable = NULL;
+    localNumBackends = 0;
+}
+
+
+
+/* ----------
+ * pgstat_report_activity() -
+ *
+ *    Called from tcop/postgres.c to report what the backend is actually doing
+ *    (but note cmd_str can be NULL for certain cases).
+ *
+ * All updates of the status entry follow the protocol of bumping
+ * st_changecount before and after.  We use a volatile pointer here to
+ * ensure the compiler doesn't try to get cute.
+ * ----------
+ */
+void
+pgstat_report_activity(BackendState state, const char *cmd_str)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    TimestampTz start_timestamp;
+    TimestampTz current_timestamp;
+    int            len = 0;
+
+    TRACE_POSTGRESQL_STATEMENT_STATUS(cmd_str);
+
+    if (!beentry)
+        return;
+
+    if (!pgstat_track_activities)
+    {
+        if (beentry->st_state != STATE_DISABLED)
+        {
+            volatile PGPROC *proc = MyProc;
+
+            /*
+             * track_activities is disabled, but we last reported a
+             * non-disabled state.  As our final update, change the state and
+             * clear fields we will not be updating anymore.
+             */
+            pgstat_increment_changecount_before(beentry);
+            beentry->st_state = STATE_DISABLED;
+            beentry->st_state_start_timestamp = 0;
+            beentry->st_activity_raw[0] = '\0';
+            beentry->st_activity_start_timestamp = 0;
+            /* st_xact_start_timestamp and wait_event_info are also disabled */
+            beentry->st_xact_start_timestamp = 0;
+            proc->wait_event_info = 0;
+            pgstat_increment_changecount_after(beentry);
+        }
+        return;
+    }
+
+    /*
+     * To minimize the time spent modifying the entry, fetch all the needed
+     * data first.
+     */
+    start_timestamp = GetCurrentStatementStartTimestamp();
+    if (cmd_str != NULL)
+    {
+        /*
+         * Compute length of to-be-stored string unaware of multi-byte
+         * characters. For speed reasons that'll get corrected on read, rather
+         * than computed every write.
+         */
+        len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
+    }
+    current_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Now update the status entry
+     */
+    pgstat_increment_changecount_before(beentry);
+
+    beentry->st_state = state;
+    beentry->st_state_start_timestamp = current_timestamp;
+
+    if (cmd_str != NULL)
+    {
+        memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+        beentry->st_activity_raw[len] = '\0';
+        beentry->st_activity_start_timestamp = start_timestamp;
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_start_command() -
+ *
+ * Set st_progress_command (and st_progress_command_target) in own backend
+ * entry.  Also, zero-initialize st_progress_param array.
+ *-----------
+ */
+void
+pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = cmdtype;
+    beentry->st_progress_command_target = relid;
+    MemSet(&beentry->st_progress_param, 0, sizeof(beentry->st_progress_param));
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_param() -
+ *
+ * Update index'th member in st_progress_param[] of own backend entry.
+ *-----------
+ */
+void
+pgstat_progress_update_param(int index, int64 val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    Assert(index >= 0 && index < PGSTAT_NUM_PROGRESS_PARAM);
+
+    if (!beentry || !pgstat_track_activities)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_param[index] = val;
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_update_multi_param() -
+ *
+ * Update multiple members in st_progress_param[] of own backend entry.
+ * This is atomic; readers won't see intermediate states.
+ *-----------
+ */
+void
+pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+    int            i;
+
+    if (!beentry || !pgstat_track_activities || nparam == 0)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+
+    for (i = 0; i < nparam; ++i)
+    {
+        Assert(index[i] >= 0 && index[i] < PGSTAT_NUM_PROGRESS_PARAM);
+
+        beentry->st_progress_param[index[i]] = val[i];
+    }
+
+    pgstat_increment_changecount_after(beentry);
+}
+
+/*-----------
+ * pgstat_progress_end_command() -
+ *
+ * Reset st_progress_command (and st_progress_command_target) in own backend
+ * entry.  This signals the end of the command.
+ *-----------
+ */
+void
+pgstat_progress_end_command(void)
+{
+    volatile PgBackendStatus *beentry = MyBEEntry;
+
+    if (!beentry)
+        return;
+    if (!pgstat_track_activities
+        && beentry->st_progress_command == PROGRESS_COMMAND_INVALID)
+        return;
+
+    pgstat_increment_changecount_before(beentry);
+    beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
+    beentry->st_progress_command_target = InvalidOid;
+    pgstat_increment_changecount_after(beentry);
+}
+
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *raw_activity)
+{
+    char       *activity;
+    int            rawlen;
+    int            cliplen;
+
+    /*
+     * Some callers, like pgstat_get_backend_current_activity(), do not
+     * guarantee that the buffer isn't concurrently modified. We try to take
+     * care that the buffer is always terminated by a NUL byte regardless, but
+     * let's still be paranoid about the string's length. In those cases the
+     * underlying buffer is guaranteed to be pgstat_track_activity_query_size
+     * large.
+     */
+    activity = pnstrdup(raw_activity, pgstat_track_activity_query_size - 1);
+
+    /* now double-guaranteed to be NUL terminated */
+    rawlen = strlen(activity);
+
+    /*
+     * All supported server-encodings make it possible to determine the length
+     * of a multi-byte character from its first byte (this is not the case for
+     * client encodings, see GB18030). As st_activity is always stored using
+     * server encoding, this allows us to perform multi-byte aware truncation,
+     * even if the string earlier was truncated in the middle of a multi-byte
+     * character.
+     */
+    cliplen = pg_mbcliplen(activity, rawlen,
+                           pgstat_track_activity_query_size - 1);
+
+    activity[cliplen] = '\0';
+
+    return activity;
+}
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
new file mode 100644
index 0000000000..3849c6ec05
--- /dev/null
+++ b/src/backend/statmon/pgstat.c
@@ -0,0 +1,4072 @@
+/* ----------
+ * pgstat.c
+ *
+ *    Statistics collector facility.
+ *
+ *  Collects per-table and per-function usage statistics of backends and shares
+ *  them among all backends via shared memory. Every backend records
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_report_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared
+ *  memory. To avoid congestion on the shared memory, we do that not often
+ *  than PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, such numbers are
+ *  postponed to the next chances with the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
+ * 
+ *  pgstat_fetch_stat_*() are used to read the statistics numbers. There are
+ *  two ways of reading the shared statistics. Transactional and
+ *  one-shot. Retrieved numbers are stored in local hash which persists until
+ *  transaction-end in the former type. One the other hand autovacuum, which
+ *  doesn't need such characteristics, uses one-shot mode, which just copies
+ *  the data into palloc'ed memory.
+ *
+ *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
+ *
+ *    src/backend/statmon/pgstat.c
+ * ----------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "pgstat.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase_rmgr.h"
+#include "access/xact.h"
+#include "bestatus.h"
+#include "catalog/pg_database.h"
+#include "catalog/pg_proc.h"
+#include "miscadmin.h"
+#include "postmaster/autovacuum.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/procsignal.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+/* ----------
+ * Timer definitions.
+ * ----------
+ */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
+
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
+
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
+
+/* ----------
+ * The initial size hints for the hash tables used in the collector.
+ * ----------
+ */
+#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_FUNCTION_HASH_SIZE    512
+
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED    0
+#define PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT 2
+
+typedef enum PgStat_TableLookupState
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} PgStat_TableLookupState;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+bool        pgstat_track_counts = false;
+int            pgstat_track_functions = TRACK_FUNC_OFF;
+
+/* ----------
+ * Built from GUC parameter
+ * ----------
+ */
+char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
+char       *pgstat_stat_filename = NULL;
+char       *pgstat_stat_tmpname = NULL;
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
+/*
+ * BgWriter global statistics counters (unused in other processes).
+ * Stored directly in a stats message structure so it can be sent
+ * without needing to copy things around.  We assume this inits to zeroes.
+ */
+PgStat_BgWriter BgWriterStats;
+
+/* ----------
+ * Local data
+ * ----------
+ */
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static MemoryContext pgSharedStatsContext = NULL;
+
+/* memory context for snapshots */
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *snapshot_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
+/*
+ * Structures in which backends store per-table info that's waiting to be
+ * written to shared stats.
+ *
+ * NOTE: once allocated, TabStatusArray structures are never moved or deleted
+ * for the life of the backend.  Also, we zero out the t_id fields of the
+ * contained PgStat_TableStatus structs whenever they are not actively in use.
+ * This allows relcache pgstat_info pointers to be treated as long-lived data,
+ * avoiding repeated searches in pgstat_initstats() when a relation is
+ * repeatedly opened during a transaction.
+ */
+#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
+
+typedef struct TabStatusArray
+{
+    struct TabStatusArray *tsa_next;    /* link to next array, if any */
+    int            tsa_used;        /* # entries currently used */
+    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
+} TabStatusArray;
+
+static TabStatusArray *pgStatTabList = NULL;
+
+/*
+ * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ */
+typedef struct TabStatHashEntry
+{
+    Oid            t_id;
+    PgStat_TableStatus *tsa_entry;
+} TabStatHashEntry;
+
+/*
+ * Hash table for O(1) t_id -> tsa_entry lookup
+ */
+static HTAB *pgStatTabHash = NULL;
+
+/*
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
+ */
+static HTAB *pgStatFunctions = NULL;
+
+/* dbentry struct for snapshot */
+typedef struct PgStat_StatDBEntry_snapshot
+{
+    PgStat_StatDBEntry shared_part;
+
+    HTAB *snapshot_tables;                /* table entry snapshot */
+    HTAB *snapshot_functions;            /* function entry snapshot */
+    dshash_table    *dshash_tables;        /* attached tables dshash */
+    dshash_table    *dshash_functions;    /* attached functions dshash */
+} PgStat_StatDBEntry_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct PgStat_SnapshotContext
+{
+    char           *hashname;            /* name of the snapshot hash */
+    HTAB          **hash;                /* place to store hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table  **dshash;                /* Use this dshash if any */
+    dshash_table_handle    dsh_handle;        /* Use this if dshash = NULL */
+    const dshash_parameters *dsh_params;/* dshash params */
+} PgStat_SnapshotContext;
+
+/*
+ *  Backends store various database-wide info that's waiting to be flushed out
+ *  to shared memory in these variables. pending_recovery_conflicts represents
+ *  several kinds of conflict info.
+ */
+static bool        pending_recovery_conflicts = false;
+static int        pending_deadlocks = 0;
+static size_t    pending_files = 0;
+static size_t    pending_filesize = 0;
+
+/*
+ * Tuple insertion/deletion counts for an open transaction can't be propagated
+ * into PgStat_TableStatus counters until we know if it is going to commit
+ * or abort.  Hence, we keep these counts in per-subxact structs that live
+ * in TopTransactionContext.  This data structure is designed on the assumption
+ * that subxacts won't usually modify very many tables.
+ */
+typedef struct PgStat_SubXactStatus
+{
+    int            nest_level;        /* subtransaction nest level */
+    struct PgStat_SubXactStatus *prev;    /* higher-level subxact if any */
+    PgStat_TableXactStatus *first;    /* head of list for this subxact */
+} PgStat_SubXactStatus;
+
+static PgStat_SubXactStatus *pgStatXactStack = NULL;
+
+static int    pgStatXactCommit = 0;
+static int    pgStatXactRollback = 0;
+PgStat_Counter pgStatBlockReadTime = 0;
+PgStat_Counter pgStatBlockWriteTime = 0;
+
+/* Record that's written to 2PC state file when pgstat state is persisted */
+typedef struct TwoPhasePgStatRecord
+{
+    PgStat_Counter tuples_inserted; /* tuples inserted in xact */
+    PgStat_Counter tuples_updated;    /* tuples updated in xact */
+    PgStat_Counter tuples_deleted;    /* tuples deleted in xact */
+    PgStat_Counter inserted_pre_trunc;    /* tuples inserted prior to truncate */
+    PgStat_Counter updated_pre_trunc;    /* tuples updated prior to truncate */
+    PgStat_Counter deleted_pre_trunc;    /* tuples deleted prior to truncate */
+    Oid            t_id;            /* table's OID */
+    bool        t_shared;        /* is it a shared catalog? */
+    bool        t_truncated;    /* was the relation truncated? */
+} TwoPhasePgStatRecord;
+
+/*
+ * context struct to share some data 
+ */
+typedef struct
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_flush_tabstat_context;
+
+/*
+ * Info about current snapshot of stats
+ */
+TimestampTz backend_cache_expire = 0; /* local cache expiration time */
+bool        first_in_xact = true;      /* first fetch after the last tr end */
+
+/*
+ * Cluster wide statistics.
+
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by statistics collector code and
+ * snapshot_* are cached stats for the reader code.
+ */
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
+/*
+ * Total time charged to functions so far in the current backend.
+ * We use this to help separate "self" and "other" time charges.
+ * (We assume this initializes to zero.)
+ */
+static instr_time total_func_time;
+
+
+/* ----------
+ * Local function forward declarations
+ * ----------
+ */
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+static bool pgstat_flush_tabstats(pgstat_flush_tabstat_context *cxt,
+                                  bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_tabstat_context *cxt,
+                                 bool nowait, PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_tabstat_context *cxt,
+                                   bool nowait);
+static bool pgstat_flush_miscstats(pgstat_flush_tabstat_context *cxt,
+                                   bool force);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupState *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static HTAB *create_local_stats_hash(const char *name, size_t keysize,
+                                     size_t entrysize, int nentries);
+static void *snapshot_statentry(PgStat_SnapshotContext *cxt, bool oneshot,
+                                Oid key);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
+
+static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot);
+static void backend_snapshot_global_stats(void);
+static void pgstat_beshutdown_hook(int code, Datum arg);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+//pgstat_update_archiver
+//pgstat_update_bgwriter
+
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+
+static void initialize_dbentry_nonpersistent_members(PgStat_StatDBEntry *dbentry);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry, bool initialize);
+
+/* ------------------------------------------------------------
+ * Public functions called from postmaster follow
+ * ------------------------------------------------------------
+ */
+
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
+{
+    /* trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
+}
+
+void
+pgstat_initialize(void)
+{
+    /* Set up a process-exit hook to clean up */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
+}
+
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    /* create the database hash */
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    MemoryContextSwitchTo(pgSharedStatsContext);
+    snapshot_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgSharedStatsContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgSharedStatsContext)
+        pgSharedStatsContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Shared activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/*
+ * subroutine for pgstat_reset_all
+ */
+static void
+pgstat_reset_remove_files(const char *directory)
+{
+    DIR           *dir;
+    struct dirent *entry;
+    char        fname[MAXPGPATH * 2];
+
+    dir = AllocateDir(directory);
+    while ((entry = ReadDir(dir, directory)) != NULL)
+    {
+        int            nchars;
+        Oid            tmp_oid;
+
+        /*
+         * Skip directory entries that don't match the file names we write.
+         * See get_dbstat_filename for the database-specific pattern.
+         */
+        if (strncmp(entry->d_name, "global.", 7) == 0)
+            nchars = 7;
+        else
+        {
+            nchars = 0;
+            (void) sscanf(entry->d_name, "db_%u.%n",
+                          &tmp_oid, &nchars);
+            if (nchars <= 0)
+                continue;
+            /* %u allows leading whitespace, so reject that */
+            if (strchr("0123456789", entry->d_name[3]) == NULL)
+                continue;
+        }
+
+        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
+            strcmp(entry->d_name + nchars, "stat") != 0)
+            continue;
+
+        snprintf(fname, sizeof(fname), "%s/%s", directory,
+                 entry->d_name);
+        unlink(fname);
+    }
+    FreeDir(dir);
+}
+
+/*
+ * pgstat_reset_all_counters: subroutine for pgstat_reset_all
+ *
+ * clear all counters on shared memory
+ */
+static void
+pgstat_reset_all_counters(void)
+{
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
+
+    Assert (db_stats);
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry, true);
+        dshash_release_lock(db_stats, dbentry);
+    }
+
+    /*
+     * Reset global counters
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+    memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    LWLockRelease(StatsLock);
+}
+
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats files and on-memory counters.  This is currently used only
+ * if WAL recovery is needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    pgstat_reset_all_counters();
+}
+
+/*
+ * Create the filename for a DB stat file; filename is the output buffer, of
+ * length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, db_stats, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_db_statsfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgSharedStatsContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stas area. No lock is needed since
+     * we're alone now.
+     */
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(db_stats, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* initialize the non-persistent portion */
+                initialize_dbentry_nonpersistent_members(dbentry);
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
+                dshash_release_lock(db_stats, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry   *tabentry;
+    PgStat_StatTabEntry        tabbuf;
+    PgStat_StatFuncEntry    funcbuf;
+    PgStat_StatFuncEntry   *funcentry;
+    dshash_table           *tabhash = NULL;
+    dshash_table           *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+                
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+                
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
+ */
+
+/* ----------
+ * pgstat_flush_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up by calling
+ *    this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *    has elapsed since last cleanup. On the other hand updates by regular
+ *    backends happen with the interval not shorter than
+ *    PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns the time until the next update time in milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
+ *    transaction stop time as an approximation of current time. 
+ *    ----------
+ */
+long
+pgstat_flush_stat(bool force)
+{
+    static TimestampTz last_report = 0;
+    static TimestampTz oldest_pending = 0;
+    TimestampTz now;
+    pgstat_flush_tabstat_context cxt = {0};
+    bool        other_pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+    bool        pending_stats = false;
+
+    /*
+     * We try to flush any local data waiting to flushed out to shared memory.
+     */
+    if (pending_recovery_conflicts || pending_deadlocks != 0 ||
+        pending_files != 0)
+        other_pending_stats = true;
+
+    /* Don't expend a clock check if nothing to do */
+    if (!other_pending_stats && pgStatFunctions == NULL && 
+        (pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+        pgStatXactCommit == 0 && pgStatXactRollback == 0)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't update shared stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since we last updated one.
+         * Returns time to wait in the case.
+         */
+        TimestampDifference(last_report, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            /* we know we have some statistics */
+            if (oldest_pending == 0)
+                oldest_pending = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (oldest_pending > 0)
+        {
+            TimestampDifference(oldest_pending, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    last_report = now;
+
+    /* Publish report time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_report)
+        StatsShmem->last_update = last_report;
+    LWLockRelease(StatsLock);
+    
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_tabstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out miscellaneous stats */
+    if (other_pending_stats && !pgstat_flush_miscstats(&cxt, !force))
+        pending_stats = true;
+
+    /*  Unpin dbentry if pinned */
+    if (cxt.mydbentry)
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+
+    /* record oldest pending update time */
+    if (pending_stats)
+    {
+        if (oldest_pending < now)
+            oldest_pending = now;
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    oldest_pending = 0;
+
+    return 0;
+}
+
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
+
+/*
+ * pgstat_flush_tabstatsh: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns with false if requried lock was not acquired
+ *  immediately. In the case, infos of some tables may be left alone in TSA to
+ *  wait for the next chance. cxt holds some dshash related values that we
+ *  want to keep during the shared stats update.  Returns true if no stats
+ *  info remains. Caller must detach dshashes stored in cxt after use.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_tabstats(pgstat_flush_tabstat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
+
+    /*
+     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
+     * entries it points to. We recreate it if is needed.
+     */
+    hash_destroy(pgStatTabHash);
+    pgStatTabHash = NULL;
+
+    /*
+     * Scan through the TabStatusArray struct(s) to find tables that actually
+     * have counts, and try flushing it out to shared statistics.
+     */
+    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    {
+        for (i = 0; i < tsa->tsa_used; i++)
+        {
+            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
+
+            /* Shouldn't have any pending transaction-dependent counts */
+            Assert(entry->trans == NULL);
+
+            /*
+             * Ignore entries that didn't accumulate any actual counts, such
+             * as indexes that were opened by the planner but not used.
+             */
+            if (memcmp(&entry->t_counts, &all_zeroes,
+                       sizeof(PgStat_TableCounts)) == 0)
+                continue;
+
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
+            {
+                /*
+                 * Failed. Leave it alone filling at the beginning in TSA.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment. There must be
+                 * enough space segments since we are just leaving some of the
+                 * current elements.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /* Move the entry if needed */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
+            }
+        }
+    }
+
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
+
+    /* and set the new TSA hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries , but still may
+     * use that for my database.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
+}
+
+
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ * 
+ *  Returns true if the entry is flushed.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_tabstat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
+{
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
+
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
+
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
+    {
+        /* We don't have corresponding dbentry here */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+        
+        /*
+         * We don't keep dshash-lock on dbentries, since the dbentries cannot
+         * be dropped meanwhile. We will use generation to isolate resetted
+         * table/function hashes.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We attach mydb tabhash once per flushing. This is the chance to
+             * update database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
+    }
+    else
+    {
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
+    }        
+
+
+    /*
+     * dbentry is always available here, so try flush table stats first, then
+     * database stats.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
+}
+
+/*
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure and leave some of the
+ *  entries alone in the local hash.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_funcstats(pgstat_flush_tabstat_context *cxt, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    dshash_table   *funchash;
+    HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
+    bool            finished = true;
+
+    /* nothing to do, just return  */
+    if (pgStatFunctions == NULL)
+        return true;
+
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared statistics.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
+                   sizeof(PgStat_FunctionCounts)) == 0)
+            continue;
+
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
+
+        /*
+         * We could'nt acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
+        {
+            finished = false;
+            continue;
+        }
+
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    }
+
+    return finished;
+}
+
+/*
+ * pgstat_flush_misctats: Flushes out miscellaneous stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all the miscellaneous stats are flushed out.
+ */
+static bool
+pgstat_flush_miscstats(pgstat_flush_tabstat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* Lock failure, retrun. */
+        if (cxt->mydbentry == NULL)
+            return false;
+            
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }        
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (pending_recovery_conflicts)
+        pgstat_cleanup_recovery_conflict(cxt->mydbentry);
+    if (pending_deadlocks != 0)
+        pgstat_cleanup_deadlock(cxt->mydbentry);
+    if (pending_files != 0)
+        pgstat_cleanup_tempfile(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
+
+/*
+ * Lookup the hash table entry for the specified database. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupState *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(db_stats, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables for tables and functions, too.
+             */
+            reset_dbentry_counters(result, true);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(db_stats, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
+
+    return result;
+}
+
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
+ */
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
+{
+    PgStat_StatTabEntry *result;
+    bool        found;
+
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
+
+    if (!create && !found)
+        return NULL;
+
+    /* If not found, initialize the new one. */
+    if (!found)
+    {
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
+    }
+
+    return result;
+}
+
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *    Remove objects he can get rid of.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* If not done for this transaction, take a snapshot of stats */
+    backend_snapshot_global_stats();
+
+    /*
+     * Read pg_database and make a list of OIDs of all existing databases
+     */
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+
+    /*
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
+     */
+
+    dshash_seq_init(&dshstat, db_stats, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            dbid = dbentry->databaseid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /* the DB entry for shared tables (with InvalidOid) is never dropped */
+        if (OidIsValid(dbid) &&
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            pgstat_drop_database(dbid);
+    }
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Lookup our own database entry; if not found, nothing more to do.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
+        return;
+
+    /*
+     * Similarly to above, make a list of all known relations in this DB.
+     */
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+
+    /*
+     * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
+     */
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
+    {
+        Oid            tabid = tabentry->tableid;
+
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
+    }
+    dshash_detach(dshtable);
+
+    /* Clean up */
+    hash_destroy(oidtab);
+
+    /*
+     * Now repeat the above steps for functions.  However, we needn't bother
+     * in the common case where no function stats are being collected.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshtable =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
+        {
+            Oid            funcid = funcentry->functionid;
+
+            CHECK_FOR_INTERRUPTS();
+
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+                continue;
+
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
+        }
+
+        hash_destroy(oidtab);
+
+        dshash_detach(dshtable);
+    }
+    dshash_release_lock(db_stats, dbentry);
+}
+
+
+/*
+ * pgstat_collect_oids() -
+ *
+ *    Collect the OIDs of all objects listed in the specified system catalog
+ *    into a temporary hash table.  Caller should hash_destroy the result after
+ *    use.  (However, we make the table in CurrentMemoryContext so that it will
+ *    be freed properly in event of an error.)
+ */
+static HTAB *
+pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+{
+    HTAB       *htab;
+    HASHCTL        hash_ctl;
+    Relation    rel;
+    HeapScanDesc scan;
+    HeapTuple    tup;
+    Snapshot    snapshot;
+
+    memset(&hash_ctl, 0, sizeof(hash_ctl));
+    hash_ctl.keysize = sizeof(Oid);
+    hash_ctl.entrysize = sizeof(Oid);
+    hash_ctl.hcxt = CurrentMemoryContext;
+    htab = hash_create("Temporary table of OIDs",
+                       PGSTAT_TAB_HASH_SIZE,
+                       &hash_ctl,
+                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    rel = heap_open(catalogid, AccessShareLock);
+    snapshot = RegisterSnapshot(GetLatestSnapshot());
+    scan = heap_beginscan(rel, snapshot, 0, NULL);
+    while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
+    {
+        Oid            thisoid;
+        bool        isnull;
+
+        thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
+        Assert(!isnull);
+
+        CHECK_FOR_INTERRUPTS();
+
+        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+    }
+    heap_endscan(scan);
+    UnregisterSnapshot(snapshot);
+    heap_close(rel, AccessShareLock);
+
+    return htab;
+}
+
+
+/* ----------
+ * pgstat_drop_database() -
+ *
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats update happens after this, this entry will re-created but
+ *    we will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
+ * ----------
+ */
+void
+pgstat_drop_database(Oid databaseid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+
+        dshash_delete_entry(db_stats, (void *)dbentry);
+    }
+}
+
+
+/* ----------
+ * pgstat_reset_counters() -
+ *
+ *    Reset counters for our database.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_counters(void)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This dtabase is active, we can release the lock immediately. */
+    dshash_release_lock(db_stats, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry, false);
+
+}
+
+/* ----------
+ * pgstat_reset_shared_counters() -
+ *
+ *    Reset cluster-wide shared counters.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_shared_counters(const char *target)
+{
+    Assert(db_stats);
+
+    /* Reset the archiver statistics for the cluster. */
+    if (strcmp(target, "archiver") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else if (strcmp(target, "bgwriter") == 0)
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("unrecognized reset target: \"%s\"", target),
+                 errhint("Target must be \"archiver\" or \"bgwriter\".")));
+    
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_reset_single_counter() -
+ *
+ *    Reset a single counter.
+ *
+ *    Permission checking for this function is managed through the normal
+ *    GRANT system.
+ * ----------
+ */
+void
+pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
+{
+    PgStat_StatDBEntry *dbentry;
+    int generation;
+
+    Assert(db_stats);
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
+        return;
+
+    /* This dtabase is active, we can release the lock immediately. */
+    generation = pin_hashes(dbentry);
+
+    /* Set the reset timestamp for the whole database */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
+}
+
+/* ----------
+ * pgstat_report_autovac() -
+ *
+ *    Called from autovacuum.c to report startup of an autovacuum process.
+ *    We are called before InitPostgres is done, so can't rely on MyDatabaseId;
+ *    the db OID must be passed in, instead.
+ * ----------
+ */
+void
+pgstat_report_autovac(Oid dboid)
+{
+    PgStat_StatDBEntry *dbentry;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    dshash_release_lock(db_stats, dbentry);
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = GetCurrentTimestamp();
+    LWLockRelease(&dbentry->lock);
+}
+
+
+/* ---------
+ * pgstat_report_vacuum() -
+ *
+ *    Report about the table we just vacuumed.
+ * ---------
+ */
+void
+pgstat_report_vacuum(Oid tableoid, bool shared,
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
+}
+
+/* --------
+ * pgstat_report_analyze() -
+ *
+ *    Report about the table we just analyzed.
+ *
+ * Caller must provide new live- and dead-tuples estimates, as well as a
+ * flag indicating whether to reset the changes_since_analyze counter.
+ * --------
+ */
+void
+pgstat_report_analyze(Relation rel,
+                      PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                      bool resetcounter)
+{
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    /*
+     * Unlike VACUUM, ANALYZE might be running inside a transaction that has
+     * already inserted and/or deleted rows in the target table. ANALYZE will
+     * have counted such rows as live or dead respectively. Because we will
+     * report our counts of such rows at transaction end, we should subtract
+     * off these counts from what we send to the collector now, else they'll
+     * be double-counted after commit.  (This approach also ensures that the
+     * collector ends up with the right numbers if we abort instead of
+     * committing.)
+     */
+    if (rel->pgstat_info != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = rel->pgstat_info->trans; trans; trans = trans->upper)
+        {
+            livetuples -= trans->tuples_inserted - trans->tuples_deleted;
+            deadtuples -= trans->tuples_updated + trans->tuples_deleted;
+        }
+        /* count stuff inserted by already-aborted subxacts, too */
+        deadtuples -= rel->pgstat_info->t_counts.t_delta_dead_tuples;
+        /* Since ANALYZE's counts are estimates, we could have underflowed */
+        livetuples = Max(livetuples, 0);
+        deadtuples = Max(deadtuples, 0);
+    }
+
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
+}
+
+/* --------
+ * pgstat_report_recovery_conflict() -
+ *
+ *    Report a Hot Standby recovery conflict.
+ * --------
+ */
+static int pending_conflict_tablespace = 0;
+static int pending_conflict_lock = 0;
+static int pending_conflict_snapshot = 0;
+static int pending_conflict_bufferpin = 0;
+static int pending_conflict_startup_deadlock = 0;
+
+void
+pgstat_report_recovery_conflict(int reason)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pending_recovery_conflicts = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            pending_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            pending_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            pending_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            pending_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            pending_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_recovery_conflict(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending recovery conflicts
+ */
+static void
+pgstat_cleanup_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += pending_conflict_tablespace;
+    dbentry->n_conflict_lock         += pending_conflict_lock;
+    dbentry->n_conflict_snapshot    += pending_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += pending_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += pending_conflict_startup_deadlock;
+
+    pending_conflict_tablespace = 0;
+    pending_conflict_lock = 0;
+    pending_conflict_snapshot = 0;
+    pending_conflict_bufferpin = 0;
+    pending_conflict_startup_deadlock = 0;
+    
+    pending_recovery_conflicts = false;
+}
+
+/* --------
+ * pgstat_report_deadlock() -
+ *
+ *    Report a deadlock detected.
+ * --------
+ */
+void
+pgstat_report_deadlock(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pending_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_deadlock(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for pending dead locks
+ */
+static void
+pgstat_cleanup_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += pending_deadlocks;
+    pending_deadlocks = 0;
+}
+
+/* --------
+ * pgstat_report_tempfile() -
+ *
+ *    Report a temporary file.
+ * --------
+ */
+void
+pgstat_report_tempfile(size_t filesize)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        pending_filesize += filesize; /* needs check overflow */
+        pending_files++;
+    }
+
+    if (pending_files == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    pgstat_cleanup_tempfile(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
+}
+
+/*
+ * clean up function for temporary files
+ */
+static void
+pgstat_cleanup_tempfile(PgStat_StatDBEntry *dbentry)
+{
+
+    dbentry->n_temp_bytes += pending_filesize;
+    dbentry->n_temp_files += pending_files;
+    pending_filesize = 0;
+    pending_files = 0;
+
+}
+
+/*
+ * Initialize function call usage data.
+ * Called by the executor before invoking a function.
+ */
+void
+pgstat_init_function_usage(FunctionCallInfo fcinfo,
+                           PgStat_FunctionCallUsage *fcu)
+{
+    PgStat_BackendFunctionEntry *htabent;
+    bool        found;
+
+    if (pgstat_track_functions <= fcinfo->flinfo->fn_stats)
+    {
+        /* stats not wanted */
+        fcu->fs = NULL;
+        return;
+    }
+
+    if (!pgStatFunctions)
+    {
+        /* First time through - initialize function stat table */
+        HASHCTL        hash_ctl;
+
+        memset(&hash_ctl, 0, sizeof(hash_ctl));
+        hash_ctl.keysize = sizeof(Oid);
+        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
+        pgStatFunctions = hash_create("Function stat entries",
+                                      PGSTAT_FUNCTION_HASH_SIZE,
+                                      &hash_ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Get the stats entry for this function, create if necessary */
+    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
+                          HASH_ENTER, &found);
+    if (!found)
+        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+
+    fcu->fs = &htabent->f_counts;
+
+    /* save stats for this function, later used to compensate for recursion */
+    fcu->save_f_total_time = htabent->f_counts.f_total_time;
+
+    /* save current backend-wide total time */
+    fcu->save_total = total_func_time;
+
+    /* get clock time as of function start */
+    INSTR_TIME_SET_CURRENT(fcu->f_start);
+}
+
+/*
+ * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
+ *        for specified function
+ *
+ * If no entry, return NULL, don't create a new one
+ */
+PgStat_BackendFunctionEntry *
+find_funcstat_entry(Oid func_id)
+{
+    if (pgStatFunctions == NULL)
+        return NULL;
+
+    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
+                                                       (void *) &func_id,
+                                                       HASH_FIND, NULL);
+}
+
+/*
+ * Calculate function call usage and update stat counters.
+ * Called by the executor after invoking a function.
+ *
+ * In the case of a set-returning function that runs in value-per-call mode,
+ * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ * calls for what the user considers a single call of the function.  The
+ * finalize flag should be TRUE on the last call.
+ */
+void
+pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
+{
+    PgStat_FunctionCounts *fs = fcu->fs;
+    instr_time    f_total;
+    instr_time    f_others;
+    instr_time    f_self;
+
+    /* stats not wanted? */
+    if (fs == NULL)
+        return;
+
+    /* total elapsed time in this function call */
+    INSTR_TIME_SET_CURRENT(f_total);
+    INSTR_TIME_SUBTRACT(f_total, fcu->f_start);
+
+    /* self usage: elapsed minus anything already charged to other calls */
+    f_others = total_func_time;
+    INSTR_TIME_SUBTRACT(f_others, fcu->save_total);
+    f_self = f_total;
+    INSTR_TIME_SUBTRACT(f_self, f_others);
+
+    /* update backend-wide total time */
+    INSTR_TIME_ADD(total_func_time, f_self);
+
+    /*
+     * Compute the new f_total_time as the total elapsed time added to the
+     * pre-call value of f_total_time.  This is necessary to avoid
+     * double-counting any time taken by recursive calls of myself.  (We do
+     * not need any similar kluge for self time, since that already excludes
+     * any recursive calls.)
+     */
+    INSTR_TIME_ADD(f_total, fcu->save_f_total_time);
+
+    /* update counters in function stats table */
+    if (finalize)
+        fs->f_numcalls++;
+    fs->f_total_time = f_total;
+    INSTR_TIME_ADD(fs->f_self_time, f_self);
+}
+
+
+/* ----------
+ * pgstat_initstats() -
+ *
+ *    Initialize a relcache entry to count access statistics.
+ *    Called whenever a relation is opened.
+ *
+ *    We assume that a relcache entry's pgstat_info field is zeroed by
+ *    relcache.c when the relcache entry is made; thereafter it is long-lived
+ *    data.  We can avoid repeated searches of the TabStatus arrays when the
+ *    same relation is touched repeatedly within a transaction.
+ * ----------
+ */
+void
+pgstat_initstats(Relation rel)
+{
+    Oid            rel_id = rel->rd_id;
+    char        relkind = rel->rd_rel->relkind;
+
+    Assert(db_stats);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+    {
+        /* We're not counting at all */
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /* We only count stats for things that have storage */
+    if (!(relkind == RELKIND_RELATION ||
+          relkind == RELKIND_MATVIEW ||
+          relkind == RELKIND_INDEX ||
+          relkind == RELKIND_TOASTVALUE ||
+          relkind == RELKIND_SEQUENCE))
+    {
+        rel->pgstat_info = NULL;
+        return;
+    }
+
+    /*
+     * If we already set up this relation in the current transaction, nothing
+     * to do.
+     */
+    if (rel->pgstat_info != NULL &&
+        rel->pgstat_info->t_id == rel_id)
+        return;
+
+    /* Else find or make the PgStat_TableStatus entry, and update link */
+    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+}
+
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
+/*
+ * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ */
+static PgStat_TableStatus *
+get_tabstat_entry(Oid rel_id, bool isshared)
+{
+    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *entry;
+    TabStatusArray *tsa;
+    bool        found;
+
+    /*
+     * Create hash table if we don't have it already.
+     */
+    if (pgStatTabHash == NULL)
+        pgStatTabHash = create_tabstat_hash();
+
+    /*
+     * Find an entry or create a new one.
+     */
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    if (!found)
+    {
+        /* initialize new entry with null pointer */
+        hash_entry->tsa_entry = NULL;
+    }
+
+    /*
+     * If entry is already valid, we're done.
+     */
+    if (hash_entry->tsa_entry)
+        return hash_entry->tsa_entry;
+
+    /*
+     * Locate the first pgStatTabList entry with free space, making a new list
+     * entry if needed.  Note that we could get an OOM failure here, but if so
+     * we have left the hashtable and the list in a consistent state.
+     */
+    if (pgStatTabList == NULL)
+    {
+        /* Set up first pgStatTabList entry */
+        pgStatTabList = (TabStatusArray *)
+            MemoryContextAllocZero(TopMemoryContext,
+                                   sizeof(TabStatusArray));
+    }
+
+    tsa = pgStatTabList;
+    while (tsa->tsa_used >= TABSTAT_QUANTUM)
+    {
+        if (tsa->tsa_next == NULL)
+            tsa->tsa_next = (TabStatusArray *)
+                MemoryContextAllocZero(TopMemoryContext,
+                                       sizeof(TabStatusArray));
+        tsa = tsa->tsa_next;
+    }
+
+    /*
+     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
+     * the entry was already zeroed, either at creation or after last use.
+     */
+    entry = &tsa->tsa_entries[tsa->tsa_used++];
+    entry->t_id = rel_id;
+    entry->t_shared = isshared;
+
+    /*
+     * Now we can fill the entry in pgStatTabHash.
+     */
+    hash_entry->tsa_entry = entry;
+
+    return entry;
+}
+
+/*
+ * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+ *
+ * If no entry, return NULL, don't create a new one
+ *
+ * Note: if we got an error in the most recent execution of pgstat_report_stat,
+ * it's possible that an entry exists but there's no hashtable entry for it.
+ * That's okay, we'll treat this case as "doesn't exist".
+ */
+PgStat_TableStatus *
+find_tabstat_entry(Oid rel_id)
+{
+    TabStatHashEntry *hash_entry;
+
+    /* If hashtable doesn't exist, there are no entries at all */
+    if (!pgStatTabHash)
+        return NULL;
+
+    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
+    if (!hash_entry)
+        return NULL;
+
+    /* Note that this step could also return NULL, but that's correct */
+    return hash_entry->tsa_entry;
+}
+
+/*
+ * get_tabstat_stack_level - add a new (sub)transaction stack entry if needed
+ */
+static PgStat_SubXactStatus *
+get_tabstat_stack_level(int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state == NULL || xact_state->nest_level != nest_level)
+    {
+        xact_state = (PgStat_SubXactStatus *)
+            MemoryContextAlloc(TopTransactionContext,
+                               sizeof(PgStat_SubXactStatus));
+        xact_state->nest_level = nest_level;
+        xact_state->prev = pgStatXactStack;
+        xact_state->first = NULL;
+        pgStatXactStack = xact_state;
+    }
+    return xact_state;
+}
+
+/*
+ * add_tabstat_xact_level - add a new (sub)transaction state record
+ */
+static void
+add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
+{
+    PgStat_SubXactStatus *xact_state;
+    PgStat_TableXactStatus *trans;
+
+    /*
+     * If this is the first rel to be modified at the current nest level, we
+     * first have to push a transaction stack entry.
+     */
+    xact_state = get_tabstat_stack_level(nest_level);
+
+    /* Now make a per-table stack entry */
+    trans = (PgStat_TableXactStatus *)
+        MemoryContextAllocZero(TopTransactionContext,
+                               sizeof(PgStat_TableXactStatus));
+    trans->nest_level = nest_level;
+    trans->upper = pgstat_info->trans;
+    trans->parent = pgstat_info;
+    trans->next = xact_state->first;
+    xact_state->first = trans;
+    pgstat_info->trans = trans;
+}
+
+/*
+ * pgstat_count_heap_insert - count a tuple insertion of n tuples
+ */
+void
+pgstat_count_heap_insert(Relation rel, PgStat_Counter n)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_inserted += n;
+    }
+}
+
+/*
+ * pgstat_count_heap_update - count a tuple update
+ */
+void
+pgstat_count_heap_update(Relation rel, bool hot)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_updated++;
+
+        /* t_tuples_hot_updated is nontransactional, so just advance it */
+        if (hot)
+            pgstat_info->t_counts.t_tuples_hot_updated++;
+    }
+}
+
+/*
+ * pgstat_count_heap_delete - count a tuple deletion
+ */
+void
+pgstat_count_heap_delete(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_info->trans->tuples_deleted++;
+    }
+}
+
+/*
+ * pgstat_truncate_save_counters
+ *
+ * Whenever a table is truncated, we save its i/u/d counters so that they can
+ * be cleared, and if the (sub)xact that executed the truncate later aborts,
+ * the counters can be restored to the saved (pre-truncate) values.  Note we do
+ * this on the first truncate in any particular subxact level only.
+ */
+static void
+pgstat_truncate_save_counters(PgStat_TableXactStatus *trans)
+{
+    if (!trans->truncated)
+    {
+        trans->inserted_pre_trunc = trans->tuples_inserted;
+        trans->updated_pre_trunc = trans->tuples_updated;
+        trans->deleted_pre_trunc = trans->tuples_deleted;
+        trans->truncated = true;
+    }
+}
+
+/*
+ * pgstat_truncate_restore_counters - restore counters when a truncate aborts
+ */
+static void
+pgstat_truncate_restore_counters(PgStat_TableXactStatus *trans)
+{
+    if (trans->truncated)
+    {
+        trans->tuples_inserted = trans->inserted_pre_trunc;
+        trans->tuples_updated = trans->updated_pre_trunc;
+        trans->tuples_deleted = trans->deleted_pre_trunc;
+    }
+}
+
+/*
+ * pgstat_count_truncate - update tuple counters due to truncate
+ */
+void
+pgstat_count_truncate(Relation rel)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+    {
+        /* We have to log the effect at the proper transactional level */
+        int            nest_level = GetCurrentTransactionNestLevel();
+
+        if (pgstat_info->trans == NULL ||
+            pgstat_info->trans->nest_level != nest_level)
+            add_tabstat_xact_level(pgstat_info, nest_level);
+
+        pgstat_truncate_save_counters(pgstat_info->trans);
+        pgstat_info->trans->tuples_inserted = 0;
+        pgstat_info->trans->tuples_updated = 0;
+        pgstat_info->trans->tuples_deleted = 0;
+    }
+}
+
+/*
+ * pgstat_update_heap_dead_tuples - update dead-tuples count
+ *
+ * The semantics of this are that we are reporting the nontransactional
+ * recovery of "delta" dead tuples; so t_delta_dead_tuples decreases
+ * rather than increasing, and the change goes straight into the per-table
+ * counter, not into transactional state.
+ */
+void
+pgstat_update_heap_dead_tuples(Relation rel, int delta)
+{
+    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
+
+    if (pgstat_info != NULL)
+        pgstat_info->t_counts.t_delta_dead_tuples -= delta;
+}
+
+
+/* ----------
+ * AtEOXact_PgStat
+ *
+ *    Called from access/transam/xact.c at top-level transaction commit/abort.
+ * ----------
+ */
+void
+AtEOXact_PgStat(bool isCommit)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Count transaction commit or abort.  (We use counters, not just bools,
+     * in case the reporting message isn't sent right away.)
+     */
+    if (isCommit)
+        pgStatXactCommit++;
+    else
+        pgStatXactRollback++;
+
+    /*
+     * Transfer transactional insert/update counts into the base tabstat
+     * entries.  We don't bother to free any of the transactional state, since
+     * it's all in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            /* restore pre-truncate stats (if any) in case of aborted xact */
+            if (!isCommit)
+                pgstat_truncate_restore_counters(trans);
+            /* count attempted actions regardless of commit/abort */
+            tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+            tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+            tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+            if (isCommit)
+            {
+                tabstat->t_counts.t_truncated = trans->truncated;
+                if (trans->truncated)
+                {
+                    /* forget live/dead stats seen by backend thus far */
+                    tabstat->t_counts.t_delta_live_tuples = 0;
+                    tabstat->t_counts.t_delta_dead_tuples = 0;
+                }
+                /* insert adds a live tuple, delete removes one */
+                tabstat->t_counts.t_delta_live_tuples +=
+                    trans->tuples_inserted - trans->tuples_deleted;
+                /* update and delete each create a dead tuple */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_updated + trans->tuples_deleted;
+                /* insert, update, delete each count as one change event */
+                tabstat->t_counts.t_changed_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated +
+                    trans->tuples_deleted;
+            }
+            else
+            {
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                /* an aborted xact generates no changed_tuple events */
+            }
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
+}
+
+/* ----------
+ * AtEOSubXact_PgStat
+ *
+ *    Called from access/transam/xact.c at subtransaction commit/abort.
+ * ----------
+ */
+void
+AtEOSubXact_PgStat(bool isCommit, int nestDepth)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * Transfer transactional insert/update counts into the next higher
+     * subtransaction state.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL &&
+        xact_state->nest_level >= nestDepth)
+    {
+        PgStat_TableXactStatus *trans;
+        PgStat_TableXactStatus *next_trans;
+
+        /* delink xact_state from stack immediately to simplify reuse case */
+        pgStatXactStack = xact_state->prev;
+
+        for (trans = xact_state->first; trans != NULL; trans = next_trans)
+        {
+            PgStat_TableStatus *tabstat;
+
+            next_trans = trans->next;
+            Assert(trans->nest_level == nestDepth);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+            if (isCommit)
+            {
+                if (trans->upper && trans->upper->nest_level == nestDepth - 1)
+                {
+                    if (trans->truncated)
+                    {
+                        /* propagate the truncate status one level up */
+                        pgstat_truncate_save_counters(trans->upper);
+                        /* replace upper xact stats with ours */
+                        trans->upper->tuples_inserted = trans->tuples_inserted;
+                        trans->upper->tuples_updated = trans->tuples_updated;
+                        trans->upper->tuples_deleted = trans->tuples_deleted;
+                    }
+                    else
+                    {
+                        trans->upper->tuples_inserted += trans->tuples_inserted;
+                        trans->upper->tuples_updated += trans->tuples_updated;
+                        trans->upper->tuples_deleted += trans->tuples_deleted;
+                    }
+                    tabstat->trans = trans->upper;
+                    pfree(trans);
+                }
+                else
+                {
+                    /*
+                     * When there isn't an immediate parent state, we can just
+                     * reuse the record instead of going through a
+                     * palloc/pfree pushup (this works since it's all in
+                     * TopTransactionContext anyway).  We have to re-link it
+                     * into the parent level, though, and that might mean
+                     * pushing a new entry into the pgStatXactStack.
+                     */
+                    PgStat_SubXactStatus *upper_xact_state;
+
+                    upper_xact_state = get_tabstat_stack_level(nestDepth - 1);
+                    trans->next = upper_xact_state->first;
+                    upper_xact_state->first = trans;
+                    trans->nest_level = nestDepth - 1;
+                }
+            }
+            else
+            {
+                /*
+                 * On abort, update top-level tabstat counts, then forget the
+                 * subtransaction
+                 */
+
+                /* first restore values obliterated by truncate */
+                pgstat_truncate_restore_counters(trans);
+                /* count attempted actions regardless of commit/abort */
+                tabstat->t_counts.t_tuples_inserted += trans->tuples_inserted;
+                tabstat->t_counts.t_tuples_updated += trans->tuples_updated;
+                tabstat->t_counts.t_tuples_deleted += trans->tuples_deleted;
+                /* inserted tuples are dead, deleted tuples are unaffected */
+                tabstat->t_counts.t_delta_dead_tuples +=
+                    trans->tuples_inserted + trans->tuples_updated;
+                tabstat->trans = trans->upper;
+                pfree(trans);
+            }
+        }
+        pfree(xact_state);
+    }
+}
+
+
+/*
+ * AtPrepare_PgStat
+ *        Save the transactional stats state at 2PC transaction prepare.
+ *
+ * In this phase we just generate 2PC records for all the pending
+ * transaction-dependent stats work.
+ */
+void
+AtPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        Assert(xact_state->nest_level == 1);
+        Assert(xact_state->prev == NULL);
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+            TwoPhasePgStatRecord record;
+
+            Assert(trans->nest_level == 1);
+            Assert(trans->upper == NULL);
+            tabstat = trans->parent;
+            Assert(tabstat->trans == trans);
+
+            record.tuples_inserted = trans->tuples_inserted;
+            record.tuples_updated = trans->tuples_updated;
+            record.tuples_deleted = trans->tuples_deleted;
+            record.inserted_pre_trunc = trans->inserted_pre_trunc;
+            record.updated_pre_trunc = trans->updated_pre_trunc;
+            record.deleted_pre_trunc = trans->deleted_pre_trunc;
+            record.t_id = tabstat->t_id;
+            record.t_shared = tabstat->t_shared;
+            record.t_truncated = trans->truncated;
+
+            RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
+                                   &record, sizeof(TwoPhasePgStatRecord));
+        }
+    }
+}
+
+/*
+ * PostPrepare_PgStat
+ *        Clean up after successful PREPARE.
+ *
+ * All we need do here is unlink the transaction stats state from the
+ * nontransactional state.  The nontransactional action counts will be
+ * reported to the stats collector immediately, while the effects on live
+ * and dead tuple counts are preserved in the 2PC state file.
+ *
+ * Note: AtEOXact_PgStat is not called during PREPARE.
+ */
+void
+PostPrepare_PgStat(void)
+{
+    PgStat_SubXactStatus *xact_state;
+
+    /*
+     * We don't bother to free any of the transactional state, since it's all
+     * in TopTransactionContext and will go away anyway.
+     */
+    xact_state = pgStatXactStack;
+    if (xact_state != NULL)
+    {
+        PgStat_TableXactStatus *trans;
+
+        for (trans = xact_state->first; trans != NULL; trans = trans->next)
+        {
+            PgStat_TableStatus *tabstat;
+
+            tabstat = trans->parent;
+            tabstat->trans = NULL;
+        }
+    }
+    pgStatXactStack = NULL;
+}
+
+/*
+ * 2PC processing routine for COMMIT PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state.
+ */
+void
+pgstat_twophase_postcommit(TransactionId xid, uint16 info,
+                           void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, commit case */
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_truncated = rec->t_truncated;
+    if (rec->t_truncated)
+    {
+        /* forget live/dead stats seen by backend thus far */
+        pgstat_info->t_counts.t_delta_live_tuples = 0;
+        pgstat_info->t_counts.t_delta_dead_tuples = 0;
+    }
+    pgstat_info->t_counts.t_delta_live_tuples +=
+        rec->tuples_inserted - rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_updated + rec->tuples_deleted;
+    pgstat_info->t_counts.t_changed_tuples +=
+        rec->tuples_inserted + rec->tuples_updated +
+        rec->tuples_deleted;
+}
+
+/*
+ * 2PC processing routine for ROLLBACK PREPARED case.
+ *
+ * Load the saved counts into our local pgstats state, but treat them
+ * as aborted.
+ */
+void
+pgstat_twophase_postabort(TransactionId xid, uint16 info,
+                          void *recdata, uint32 len)
+{
+    TwoPhasePgStatRecord *rec = (TwoPhasePgStatRecord *) recdata;
+    PgStat_TableStatus *pgstat_info;
+
+    /* Find or create a tabstat entry for the rel */
+    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+
+    /* Same math as in AtEOXact_PgStat, abort case */
+    if (rec->t_truncated)
+    {
+        rec->tuples_inserted = rec->inserted_pre_trunc;
+        rec->tuples_updated = rec->updated_pre_trunc;
+        rec->tuples_deleted = rec->deleted_pre_trunc;
+    }
+    pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
+    pgstat_info->t_counts.t_tuples_updated += rec->tuples_updated;
+    pgstat_info->t_counts.t_tuples_deleted += rec->tuples_deleted;
+    pgstat_info->t_counts.t_delta_dead_tuples +=
+        rec->tuples_inserted + rec->tuples_updated;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_dbentry() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end. If oneshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid, bool oneshot)
+{
+    /* context for snapshot_statentry */
+    static PgStat_SnapshotContext cxt =
+    {
+        .hashname = "local database stats hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    backend_snapshot_global_stats();
+
+    cxt.dshash = &db_stats;
+    cxt.hash = &snapshot_db_stats;
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(&cxt, oneshot, dbid);
+}
+
+/* ----------
+ * pgstat_fetch_stat_tabentry() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end. If oneshot is true, they are not cached and returned in a
+ *    palloc'ed memory in caller's context.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry(PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot)
+{
+    /* context for snapshot_statentry */
+    static PgStat_SnapshotContext cxt =
+    {
+        .hashname = "table stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_tables;
+    cxt.dshash = &local_dbent->dshash_tables;
+    cxt.dsh_handle = dbent->tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(&cxt, oneshot, reloid);
+}
+
+/* ----------
+ * pgstat_fetch_stat_funcentry() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end. If oneshot is true, they are not cached and returned
+ *    in a palloc'ed memory in caller's context.
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry(PgStat_StatDBEntry *dbent, Oid funcid, bool oneshot)
+{
+    /* context for snapshot_statentry */
+    static PgStat_SnapshotContext cxt =
+    {
+        .hashname = "function stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_functions;
+    cxt.dshash = &local_dbent->dshash_functions;
+    cxt.dsh_handle = dbent->functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(&cxt, oneshot, funcid);
+}
+
+/*
+ * backend_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of backend_clear_stats_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+backend_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    TimestampTz update_time = 0;
+        
+
+    /* The snapshot lives within CacheMemoryContext */
+    if (pgStatSnapshotContext == NULL)
+    {
+        pgStatSnapshotContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Stats snapshot context",
+                                  ALLOCSET_DEFAULT_SIZES);
+    }
+
+    /*
+     * This is the first call in a transaction. If we find the shared stats
+     * expires, throw away the cache.
+     */
+    if (IsTransactionState() && first_in_xact)
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (backend_cache_expire < update_time)
+        {
+            backend_clear_stats_snapshot();
+
+            /*
+             * Shared stats are updated frequently when many backends are
+             * running, but we don't want the cached stats to be expired so
+             * frequently. Keep them at least for the same duration with
+             * minimal stats update interval of a backend. As the result
+             * snapshots may live for multiple transactions.
+             */
+            backend_cache_expire =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+    
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
+
+/* ----------
+ * backend_clear_snapshot: clean up the local cache that will cause new
+ * snapshots to be read.
+ * ----------
+ */
+void
+backend_clear_stats_snapshot(void)
+{
+    if (pgStatSnapshotContext == NULL)
+        return;
+
+    MemoryContextReset(pgStatSnapshotContext);
+
+    /* mark as the resource are not allocated */
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    snapshot_db_stats = NULL;
+}
+
+/* ----------
+ * backend_fetch_stat_tabentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    that the table doesn't exist, it is just not yet known by the
+ *    collector, so the caller is better off to report ZERO instead.
+ * ----------
+ */
+PgStat_StatTabEntry *
+backend_fetch_stat_tabentry(Oid relid)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    /*
+     * If we didn't find it, maybe it's a shared table.
+     */
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry(dbentry, relid, false);
+    if (tabentry != NULL)
+        return tabentry;
+
+    return NULL;
+}
+
+
+/* ----------
+ * backend_fetch_stat_funcentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    the collected statistics for one function or NULL.
+ * ----------
+ */
+PgStat_StatFuncEntry *
+backend_fetch_stat_funcentry(Oid func_id)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatFuncEntry *funcentry = NULL;
+
+    /* Lookup our database, then find the requested function */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId, false);
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry(dbentry, func_id, false);
+
+    return funcentry;
+}
+
+/*
+ * ---------
+ * backend_fetch_stat_archiver() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the archiver statistics struct.
+ * ---------
+ */
+PgStat_ArchiverStats *
+backend_fetch_stat_archiver(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    backend_snapshot_global_stats();
+
+    return snapshot_archiverStats;
+}
+
+
+/*
+ * ---------
+ * backend_fetch_global() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    a pointer to the global statistics struct.
+ * ---------
+ */
+PgStat_GlobalStats *
+backend_fetch_global(void)
+{
+    /* If not done for this transaction, take a stats snapshot */
+    backend_snapshot_global_stats();
+
+    return snapshot_globalStats;
+}
+
+/*
+ * Shut down a single backend's statistics reporting at process exit.
+ *
+ * Flush any remaining statistics counts out to the shared statistics.
+ * Without this, operations triggered during backend exit (such as
+ * temp table deletions) won't be counted.
+ */
+static void
+pgstat_beshutdown_hook(int code, Datum arg)
+{
+    /*
+     * If we got as far as discovering our own database ID, we can flush what
+     * we did to the shared statitics.  Otherwise, we can do nothig so forget
+     * it.  (This means that accesses to pg_database during failed backend
+     * starts might never get counted.)
+     */
+    if (OidIsValid(MyDatabaseId))
+        pgstat_flush_stat(true);
+}
+
+
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+
+/* ----------
+ * pgstat_update_archiver() -
+ *
+ *    Update the stats data about the WAL file that we successfully archived or
+ *    failed to archive.
+ * ----------
+ */
+void
+pgstat_update_archiver(const char *xlog, bool failed)
+{
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_update_bgwriter() -
+ *
+ *        Update bgwriter statistics
+ * ----------
+ */
+void
+pgstat_update_bgwriter(void)
+{
+    /* We assume this initializes to zeroes */
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
+
+    /*
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid sending a completely empty message to the stats
+     * collector.
+     */
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
+
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
+}
+
+/*
+ * Lock and Unlock dbentry.
+ *
+ * To keep less memory usage, counter reset is done by recreation of dshash
+ * instead of removing individual entries taking whole-dshash lock. On the
+ * other hand dshash cannot be destroyed until all referers have gone. As the
+ * result, counter reset may wait someone writing the table counters. To avoid
+ * such waiting we prepare another generation of table/function hashes and
+ * isolate hashes that is to be destroyed but still be
+ * accessed. pin_hashes() returns "generation" of the current hashes. Unlock
+ * removes the older generation's hashes when all refers have gone.
+ */
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
+{
+    int    counter;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    counter = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
+
+    dshash_release_lock(db_stats, dbentry);
+
+    return counter;
+}
+
+/*
+ * Releases hashes in dbentry. If given generation is isolated, destroy it
+ * after all referers has gone. Otherwise just decrease reference count then
+ * return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /*
+     * using the previous generation, waiting for all referers to end.
+     */
+    Assert(dbentry->generation - 1 == generation); /* allow wrap around */
+
+    if (--dbentry->prev_refcnt == 0)
+    {
+        /* no referer remains, remove the hashes */
+        dshash_table *tables = dshash_attach(area, &dsh_tblparams,
+                                             dbentry->prev_tables, 0);
+        dshash_destroy(tables);
+
+        if (dbentry->prev_functions)
+        {
+            dshash_table *funcs =
+                dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+            dshash_destroy(funcs);
+        }
+        dbentry->prev_tables = DSM_HANDLE_INVALID;
+        dbentry->prev_functions = DSM_HANDLE_INVALID;
+    }
+    
+    LWLockRelease(&dbentry->lock);
+    return;
+}
+
+/* attach and return the specified generation of table hash */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
+}
+
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash = 
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+                
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+initialize_dbentry_nonpersistent_members(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS_DB);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
+ *
+ * Tables and functions hashes are initialized to empty.  dbentry holds
+ * previous dshash tables during old ones are still attached. If initialize is
+ * true, the previous tables are also cleared.
+ */
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry, bool initialize)
+{
+    if (initialize)
+    {
+        /* no other process can access this entry */
+        initialize_dbentry_nonpersistent_members(dbentry);
+    }
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    dbentry->n_xact_commit = 0;
+    dbentry->n_xact_rollback = 0;
+    dbentry->n_blocks_fetched = 0;
+    dbentry->n_blocks_hit = 0;
+    dbentry->n_tuples_returned = 0;
+    dbentry->n_tuples_fetched = 0;
+    dbentry->n_tuples_inserted = 0;
+    dbentry->n_tuples_updated = 0;
+    dbentry->n_tuples_deleted = 0;
+    dbentry->last_autovac_time = 0;
+    dbentry->n_conflict_tablespace = 0;
+    dbentry->n_conflict_lock = 0;
+    dbentry->n_conflict_snapshot = 0;
+    dbentry->n_conflict_bufferpin = 0;
+    dbentry->n_conflict_startup_deadlock = 0;
+    dbentry->n_temp_files = 0;
+    dbentry->n_temp_bytes = 0;
+    dbentry->n_deadlocks = 0;
+    dbentry->n_block_read_time = 0;
+    dbentry->n_block_write_time = 0;
+
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Just destroy it.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl;
+        
+        tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+    }
+
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
+
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * create_local_stats_hash() -
+ *
+ * Creates a dynahash used for table/function stats cache.
+ */
+static HTAB *
+create_local_stats_hash(const char *name, size_t keysize, size_t entrysize,
+                        int nentries)
+{
+    HTAB *result;
+    HASHCTL ctl;
+
+    /* Create the hash in the stats context */
+    ctl.keysize        = keysize;
+    ctl.entrysize    = entrysize;
+    ctl.hcxt        = pgStatSnapshotContext;
+    result = hash_create(name, nentries, &ctl,
+                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    return result;
+}
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash with cache.
+ *
+ * Returns the entry for key or NULL if not found.
+ *
+ * When !oneshot, returned entries are consistent during the current
+ * transaction. Otherwise returned entry is a palloc'ed copy in caller's
+ * memory context.
+ *
+ * *cxt->hash points to a HTAB* variable to store the hash for local cache. New
+ * one is created if it is not yet created.
+ *
+ * *cxt->dshash points to dshash_table* variable to store the attached
+ * dshash. *cxt->dsh_handle is * attached if not yet attached.
+ */
+static void *
+snapshot_statentry(PgStat_SnapshotContext *cxt, bool oneshot, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+
+    if (!oneshot)
+    {
+        /* caches the result entry */
+        bool found;
+        bool *negative;
+
+        /*
+         * Create new hash with arbitrary initial entries since we don't know
+         * how this hash will grow.
+         */
+        /* make room for negative flag at the end of entry */
+        if (!*cxt->hash)
+            *cxt->hash =
+                create_local_stats_hash(cxt->hashname, keysize,
+                                        cxt->hash_entsize + sizeof(bool), 32);
+
+        lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+        /* negative flag is placed at the end of the entry */
+        negative = (bool *) (lentry + cxt->hash_entsize);
+
+        if (!found)
+        {
+            /* not found in local cache, search shared hash */
+
+            void *sentry;
+
+            /* attach shared hash if not given, leave it for later use */
+            if (!*cxt->dshash)
+            {
+                MemoryContext oldcxt;
+
+                if (cxt->dsh_handle == DSM_HANDLE_INVALID)
+                    return NULL;
+
+                oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+                *cxt->dshash =
+                    dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+                MemoryContextSwitchTo(oldcxt);
+            }
+
+            sentry = dshash_find(*cxt->dshash, &key, false);
+
+            /*
+             * We expect that the stats for specified database exists in most
+             * cases.
+             */
+
+            if (sentry)
+            {
+                memcpy(lentry, sentry, dsh_entrysize);
+                dshash_release_lock(*cxt->dshash, sentry);
+                if (dsh_entrysize < cxt->hash_entsize)
+                    MemSet(lentry + dsh_entrysize, 0,
+                           cxt->hash_entsize - dsh_entrysize);
+            }
+
+            *negative = !sentry;
+
+            if (!sentry)
+                return NULL;
+        }
+
+        if (*negative)
+            lentry = NULL;
+    }
+    else
+    {
+        /*
+         * The caller don't want caching. Just make a copy of the entry then
+         * return.
+         */
+        void *sentry;
+
+        /* attach shared hash if not given, leave it for later use */
+        if (!*cxt->dshash)
+        {
+            if (cxt->dsh_handle == DSM_HANDLE_INVALID)
+                return NULL;
+
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+        if (sentry)
+        {
+            lentry = palloc(cxt->hash_entsize);
+            memcpy(lentry, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet(lentry + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+    }
+
+    return (void *) lentry;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..4edd980ffc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "executor/instrument.h"
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c2c445dbf4..0bb2132c71 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -41,9 +41,9 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 1f766d20d1..a0401ee494 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,10 +22,10 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include "bestatus.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 
 /*
  * copydir: copy a directory
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..6bc5fd6089 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -82,6 +82,7 @@
 #include "miscadmin.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 23ccc59f13..ceb4775b9f 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -197,6 +197,15 @@ dsm_postmaster_startup(PGShmemHeader *shim)
     dsm_control->maxitems = maxitems;
 }
 
+/*
+ * clear dsm_init state on child start.
+ */
+void
+dsm_child_init(void)
+{
+    dsm_init_done = false;
+}
+
 /*
  * Determine whether the control segment from the previous postmaster
  * invocation still exists.  If so, remove the dynamic shared memory
@@ -423,6 +432,15 @@ dsm_set_control_handle(dsm_handle h)
 }
 #endif
 
+/*
+ * Return if dsm feature is available on this process.
+ */
+bool
+dsm_is_available(void)
+{
+    return dsm_control != NULL;
+}
+
 /*
  * Create a new dynamic shared memory segment.
  *
@@ -440,8 +458,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +554,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_is_available());
 
     if (!dsm_init_done)
         dsm_backend_startup();
diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c
index aeda32c9c5..e84275d4c2 100644
--- a/src/backend/storage/ipc/dsm_impl.c
+++ b/src/backend/storage/ipc/dsm_impl.c
@@ -61,8 +61,8 @@
 #ifdef HAVE_SYS_SHM_H
 #include <sys/shm.h>
 #endif
+#include "bestatus.h"
 #include "common/file_perm.h"
-#include "pgstat.h"
 
 #include "portability/mem.h"
 #include "storage/dsm_impl.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..8ec3b6fb8f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "bestatus.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -281,8 +283,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 7da337d11f..97526f1c72 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -43,8 +43,8 @@
 #include <poll.h>
 #endif
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
 #include "postmaster/postmaster.h"
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf93357997..e893984383 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -51,9 +51,9 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/ipc/shm_mq.c b/src/backend/storage/ipc/shm_mq.c
index 6e471c3e43..cfa5c9089f 100644
--- a/src/backend/storage/ipc/shm_mq.c
+++ b/src/backend/storage/ipc/shm_mq.c
@@ -18,8 +18,8 @@
 
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/bgworker.h"
 #include "storage/procsignal.h"
 #include "storage/shm_mq.h"
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 4d10e57a80..243da57c49 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -21,8 +21,8 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 74eb449060..dd76088a29 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -25,6 +25,7 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..2cd4d5531e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -76,8 +76,8 @@
  */
 #include "postgres.h"
 
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "pg_trace.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
@@ -521,6 +521,9 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 6fc11f26f0..a8efa7cc5f 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -194,8 +194,8 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
 #include "storage/predicate_internals.h"
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0da5b19719..a60fd02894 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,8 +38,8 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "replication/slot.h"
 #include "replication/syncrep.h"
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..9e9995ae50 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,7 +28,7 @@
 #include "miscadmin.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
-#include "pgstat.h"
+#include "bestatus.h"
 #include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..884edc706a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "bestatus.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -3159,6 +3160,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_flush_stat(true);
+    }
 }
 
 
@@ -3733,6 +3740,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4181,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_flush_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4226,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4234,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d330a88e3c..c0975a8259 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -21,6 +21,7 @@
 
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_type.h"
@@ -29,7 +30,6 @@
 #include "common/keywords.h"
 #include "funcapi.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b6ba856ebe..ada7d9e973 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -33,7 +34,7 @@
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
 /* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
+extern PgStat_BgWriter bgwriterStats;
 
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
@@ -42,7 +43,7 @@ pg_stat_get_numscans(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->numscans);
@@ -58,7 +59,7 @@ pg_stat_get_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_returned);
@@ -74,7 +75,7 @@ pg_stat_get_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_fetched);
@@ -90,7 +91,7 @@ pg_stat_get_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_inserted);
@@ -106,7 +107,7 @@ pg_stat_get_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_updated);
@@ -122,7 +123,7 @@ pg_stat_get_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_deleted);
@@ -138,7 +139,7 @@ pg_stat_get_tuples_hot_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->tuples_hot_updated);
@@ -154,7 +155,7 @@ pg_stat_get_live_tuples(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->n_live_tuples);
@@ -170,7 +171,7 @@ pg_stat_get_dead_tuples(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->n_dead_tuples);
@@ -186,7 +187,7 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->changes_since_analyze);
@@ -202,7 +203,7 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->blocks_fetched);
@@ -218,7 +219,7 @@ pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->blocks_hit);
@@ -233,7 +234,7 @@ pg_stat_get_last_vacuum_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = tabentry->vacuum_timestamp;
@@ -251,7 +252,7 @@ pg_stat_get_last_autovacuum_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = tabentry->autovac_vacuum_timestamp;
@@ -269,7 +270,7 @@ pg_stat_get_last_analyze_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = tabentry->analyze_timestamp;
@@ -287,7 +288,7 @@ pg_stat_get_last_autoanalyze_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = tabentry->autovac_analyze_timestamp;
@@ -305,7 +306,7 @@ pg_stat_get_vacuum_count(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->vacuum_count);
@@ -320,7 +321,7 @@ pg_stat_get_autovacuum_count(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->autovac_vacuum_count);
@@ -335,7 +336,7 @@ pg_stat_get_analyze_count(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->analyze_count);
@@ -350,7 +351,7 @@ pg_stat_get_autoanalyze_count(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatTabEntry *tabentry;
 
-    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+    if ((tabentry = backend_fetch_stat_tabentry(relid)) == NULL)
         result = 0;
     else
         result = (int64) (tabentry->autovac_analyze_count);
@@ -364,7 +365,7 @@ pg_stat_get_function_calls(PG_FUNCTION_ARGS)
     Oid            funcid = PG_GETARG_OID(0);
     PgStat_StatFuncEntry *funcentry;
 
-    if ((funcentry = pgstat_fetch_stat_funcentry(funcid)) == NULL)
+    if ((funcentry = backend_fetch_stat_funcentry(funcid)) == NULL)
         PG_RETURN_NULL();
     PG_RETURN_INT64(funcentry->f_numcalls);
 }
@@ -375,7 +376,7 @@ pg_stat_get_function_total_time(PG_FUNCTION_ARGS)
     Oid            funcid = PG_GETARG_OID(0);
     PgStat_StatFuncEntry *funcentry;
 
-    if ((funcentry = pgstat_fetch_stat_funcentry(funcid)) == NULL)
+    if ((funcentry = backend_fetch_stat_funcentry(funcid)) == NULL)
         PG_RETURN_NULL();
     /* convert counter from microsec to millisec for display */
     PG_RETURN_FLOAT8(((double) funcentry->f_total_time) / 1000.0);
@@ -387,7 +388,7 @@ pg_stat_get_function_self_time(PG_FUNCTION_ARGS)
     Oid            funcid = PG_GETARG_OID(0);
     PgStat_StatFuncEntry *funcentry;
 
-    if ((funcentry = pgstat_fetch_stat_funcentry(funcid)) == NULL)
+    if ((funcentry = backend_fetch_stat_funcentry(funcid)) == NULL)
         PG_RETURN_NULL();
     /* convert counter from microsec to millisec for display */
     PG_RETURN_FLOAT8(((double) funcentry->f_self_time) / 1000.0);
@@ -1193,7 +1194,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_commit);
@@ -1209,7 +1210,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_xact_rollback);
@@ -1225,7 +1226,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_fetched);
@@ -1241,7 +1242,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_blocks_hit);
@@ -1257,7 +1258,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_returned);
@@ -1273,7 +1274,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_fetched);
@@ -1289,7 +1290,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_inserted);
@@ -1305,7 +1306,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_updated);
@@ -1321,7 +1322,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_tuples_deleted);
@@ -1336,7 +1337,7 @@ pg_stat_get_db_stat_reset_time(PG_FUNCTION_ARGS)
     TimestampTz result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->stat_reset_timestamp;
@@ -1354,7 +1355,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_files;
@@ -1370,7 +1371,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = dbentry->n_temp_bytes;
@@ -1385,7 +1386,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_tablespace);
@@ -1400,7 +1401,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_lock);
@@ -1415,7 +1416,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_snapshot);
@@ -1430,7 +1431,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_bufferpin);
@@ -1445,7 +1446,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_conflict_startup_deadlock);
@@ -1460,7 +1461,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (
@@ -1480,7 +1481,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     int64        result;
     PgStat_StatDBEntry *dbentry;
 
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = (int64) (dbentry->n_deadlocks);
@@ -1496,7 +1497,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_read_time) / 1000.0;
@@ -1512,7 +1513,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     /* convert counter from microsec to millisec for display */
-    if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
+    if ((dbentry = pgstat_fetch_stat_dbentry(dbid, false)) == NULL)
         result = 0;
     else
         result = ((double) dbentry->n_block_write_time) / 1000.0;
@@ -1523,69 +1524,69 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(backend_fetch_global()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(backend_fetch_global()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(backend_fetch_global()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(backend_fetch_global()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(backend_fetch_global()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double) backend_fetch_global()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double) backend_fetch_global()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(backend_fetch_global()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(backend_fetch_global()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(backend_fetch_global()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(backend_fetch_global()->buf_alloc);
 }
 
 Datum
@@ -1779,14 +1780,14 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(backend_fetch_global()->stats_timestamp);
 }
 
 /* Discard the active statistics snapshot */
 Datum
 pg_stat_clear_snapshot(PG_FUNCTION_ARGS)
 {
-    pgstat_clear_snapshot();
+    backend_clear_stats_snapshot();
 
     PG_RETURN_VOID();
 }
@@ -1865,7 +1866,10 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     BlessTupleDesc(tupdesc);
 
     /* Get statistics about the archiver process */
-    archiver_stats = pgstat_fetch_stat_archiver();
+    archiver_stats = backend_fetch_stat_archiver();
+
+    if (archiver_stats == NULL)
+        PG_RETURN_NULL();
 
     /* Fill values and NULLs */
     values[0] = Int64GetDatum(archiver_stats->archived_count);
@@ -1896,6 +1900,5 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
         values[6] = TimestampTzGetDatum(archiver_stats->stat_reset_timestamp);
 
     /* Returns the record as Datum */
-    PG_RETURN_DATUM(HeapTupleGetDatum(
-                                      heap_form_tuple(tupdesc, values, nulls)));
+    PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 5e61d908fd..2dd99f935d 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -46,11 +46,11 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/storage.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "storage/fd.h"
 #include "storage/lwlock.h"
 #include "utils/inval.h"
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index bd2e4e89d8..1eabc0f41d 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -31,12 +31,12 @@
 #endif
 
 #include "access/htup_details.h"
+#include "bestatus.h"
 #include "catalog/pg_authid.h"
 #include "common/file_perm.h"
 #include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/postmaster.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..e5dca7fe03 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -26,6 +26,7 @@
 #include "access/sysattr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "bestatus.h"
 #include "catalog/catalog.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
@@ -72,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -685,7 +689,10 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 
     /* Initialize stats collection --- must happen before first xact */
     if (!bootstrap)
+    {
+        pgstat_bearray_initialize();
         pgstat_initialize();
+    }
 
     /*
      * Load relcache entries for the shared system catalogs.  This must create
@@ -1238,6 +1245,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..62a07727d0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -33,6 +33,7 @@
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "bestatus.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "commands/async.h"
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..8939758c59 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/bestatus.h b/src/include/bestatus.h
new file mode 100644
index 0000000000..b7f6a93130
--- /dev/null
+++ b/src/include/bestatus.h
@@ -0,0 +1,555 @@
+/* ----------
+ *    bestatus.h
+ *
+ *    Definitions for the PostgreSQL backend status monitor facility
+ *
+ *    Copyright (c) 2001-2018, PostgreSQL Global Development Group
+ *
+ *    src/include/bestatus.h
+ * ----------
+ */
+#ifndef BESTATUS_H
+#define BESTATUS_H
+
+#include "datatype/timestamp.h"
+#include "libpq/pqcomm.h"
+#include "storage/proc.h"
+
+/* ----------
+ * Backend types
+ * ----------
+ */
+typedef enum BackendType
+{
+    B_AUTOVAC_LAUNCHER,
+    B_AUTOVAC_WORKER,
+    B_BACKEND,
+    B_BG_WORKER,
+    B_BG_WRITER,
+    B_CHECKPOINTER,
+    B_STARTUP,
+    B_WAL_RECEIVER,
+    B_WAL_SENDER,
+    B_WAL_WRITER,
+    B_ARCHIVER
+} BackendType;
+
+
+/* ----------
+ * Backend states
+ * ----------
+ */
+typedef enum BackendState
+{
+    STATE_UNDEFINED,
+    STATE_IDLE,
+    STATE_RUNNING,
+    STATE_IDLEINTRANSACTION,
+    STATE_FASTPATH,
+    STATE_IDLEINTRANSACTION_ABORTED,
+    STATE_DISABLED
+} BackendState;
+
+
+/* ----------
+ * Wait Classes
+ * ----------
+ */
+#define PG_WAIT_LWLOCK                0x01000000U
+#define PG_WAIT_LOCK                0x03000000U
+#define PG_WAIT_BUFFER_PIN            0x04000000U
+#define PG_WAIT_ACTIVITY            0x05000000U
+#define PG_WAIT_CLIENT                0x06000000U
+#define PG_WAIT_EXTENSION            0x07000000U
+#define PG_WAIT_IPC                    0x08000000U
+#define PG_WAIT_TIMEOUT                0x09000000U
+#define PG_WAIT_IO                    0x0A000000U
+
+/* ----------
+ * Wait Events - Activity
+ *
+ * Use this category when a process is waiting because it has no work to do,
+ * unless the "Client" or "Timeout" category describes the situation better.
+ * Typically, this should only be used for background processes.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
+    WAIT_EVENT_AUTOVACUUM_MAIN,
+    WAIT_EVENT_BGWRITER_HIBERNATE,
+    WAIT_EVENT_BGWRITER_MAIN,
+    WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_LOGICAL_APPLY_MAIN,
+    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
+    WAIT_EVENT_RECOVERY_WAL_ALL,
+    WAIT_EVENT_RECOVERY_WAL_STREAM,
+    WAIT_EVENT_SYSLOGGER_MAIN,
+    WAIT_EVENT_WAL_RECEIVER_MAIN,
+    WAIT_EVENT_WAL_SENDER_MAIN,
+    WAIT_EVENT_WAL_WRITER_MAIN
+} WaitEventActivity;
+
+/* ----------
+ * Wait Events - Client
+ *
+ * Use this category when a process is waiting to send data to or receive data
+ * from the frontend process to which it is connected.  This is never used for
+ * a background process, which has no client connection.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
+    WAIT_EVENT_CLIENT_WRITE,
+    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
+    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
+    WAIT_EVENT_SSL_OPEN_SERVER,
+    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
+    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
+    WAIT_EVENT_WAL_SENDER_WRITE_DATA
+} WaitEventClient;
+
+/* ----------
+ * Wait Events - IPC
+ *
+ * Use this category when a process cannot complete the work it is doing because
+ * it is waiting for a notification from another process.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
+    WAIT_EVENT_BGWORKER_STARTUP,
+    WAIT_EVENT_BTREE_PAGE,
+    WAIT_EVENT_CLOG_GROUP_UPDATE,
+    WAIT_EVENT_EXECUTE_GATHER,
+    WAIT_EVENT_HASH_BATCH_ALLOCATING,
+    WAIT_EVENT_HASH_BATCH_ELECTING,
+    WAIT_EVENT_HASH_BATCH_LOADING,
+    WAIT_EVENT_HASH_BUILD_ALLOCATING,
+    WAIT_EVENT_HASH_BUILD_ELECTING,
+    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
+    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
+    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
+    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
+    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
+    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
+    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
+    WAIT_EVENT_LOGICAL_SYNC_DATA,
+    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+    WAIT_EVENT_MQ_INTERNAL,
+    WAIT_EVENT_MQ_PUT_MESSAGE,
+    WAIT_EVENT_MQ_RECEIVE,
+    WAIT_EVENT_MQ_SEND,
+    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
+    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
+    WAIT_EVENT_PARALLEL_FINISH,
+    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
+    WAIT_EVENT_PROMOTE,
+    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
+    WAIT_EVENT_REPLICATION_SLOT_DROP,
+    WAIT_EVENT_SAFE_SNAPSHOT,
+    WAIT_EVENT_SYNC_REP
+} WaitEventIPC;
+
+/* ----------
+ * Wait Events - Timeout
+ *
+ * Use this category when a process is waiting for a timeout to expire.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
+    WAIT_EVENT_PG_SLEEP,
+    WAIT_EVENT_RECOVERY_APPLY_DELAY
+} WaitEventTimeout;
+
+/* ----------
+ * Wait Events - IO
+ *
+ * Use this category when a process is waiting for a IO.
+ * ----------
+ */
+typedef enum
+{
+    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
+    WAIT_EVENT_BUFFILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_READ,
+    WAIT_EVENT_CONTROL_FILE_SYNC,
+    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
+    WAIT_EVENT_CONTROL_FILE_WRITE,
+    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
+    WAIT_EVENT_COPY_FILE_READ,
+    WAIT_EVENT_COPY_FILE_WRITE,
+    WAIT_EVENT_DATA_FILE_EXTEND,
+    WAIT_EVENT_DATA_FILE_FLUSH,
+    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
+    WAIT_EVENT_DATA_FILE_PREFETCH,
+    WAIT_EVENT_DATA_FILE_READ,
+    WAIT_EVENT_DATA_FILE_SYNC,
+    WAIT_EVENT_DATA_FILE_TRUNCATE,
+    WAIT_EVENT_DATA_FILE_WRITE,
+    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
+    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
+    WAIT_EVENT_LOCK_FILE_CREATE_READ,
+    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
+    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
+    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
+    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
+    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
+    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
+    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
+    WAIT_EVENT_RELATION_MAP_READ,
+    WAIT_EVENT_RELATION_MAP_SYNC,
+    WAIT_EVENT_RELATION_MAP_WRITE,
+    WAIT_EVENT_REORDER_BUFFER_READ,
+    WAIT_EVENT_REORDER_BUFFER_WRITE,
+    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
+    WAIT_EVENT_REPLICATION_SLOT_READ,
+    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_SYNC,
+    WAIT_EVENT_REPLICATION_SLOT_WRITE,
+    WAIT_EVENT_SLRU_FLUSH_SYNC,
+    WAIT_EVENT_SLRU_READ,
+    WAIT_EVENT_SLRU_SYNC,
+    WAIT_EVENT_SLRU_WRITE,
+    WAIT_EVENT_SNAPBUILD_READ,
+    WAIT_EVENT_SNAPBUILD_SYNC,
+    WAIT_EVENT_SNAPBUILD_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
+    WAIT_EVENT_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
+    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
+    WAIT_EVENT_TWOPHASE_FILE_READ,
+    WAIT_EVENT_TWOPHASE_FILE_SYNC,
+    WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
+    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
+    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
+    WAIT_EVENT_WAL_COPY_READ,
+    WAIT_EVENT_WAL_COPY_SYNC,
+    WAIT_EVENT_WAL_COPY_WRITE,
+    WAIT_EVENT_WAL_INIT_SYNC,
+    WAIT_EVENT_WAL_INIT_WRITE,
+    WAIT_EVENT_WAL_READ,
+    WAIT_EVENT_WAL_SYNC,
+    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
+    WAIT_EVENT_WAL_WRITE
+} WaitEventIO;
+
+/* ----------
+ * Command type for progress reporting purposes
+ * ----------
+ */
+typedef enum ProgressCommandType
+{
+    PROGRESS_COMMAND_INVALID,
+    PROGRESS_COMMAND_VACUUM
+} ProgressCommandType;
+
+#define PGSTAT_NUM_PROGRESS_PARAM    10
+
+/* ----------
+ * Shared-memory data structures
+ * ----------
+ */
+
+
+/*
+ * PgBackendSSLStatus
+ *
+ * For each backend, we keep the SSL status in a separate struct, that
+ * is only filled in if SSL is enabled.
+ *
+ * All char arrays must be null-terminated.
+ */
+typedef struct PgBackendSSLStatus
+{
+    /* Information about SSL connection */
+    int            ssl_bits;
+    bool        ssl_compression;
+    char        ssl_version[NAMEDATALEN];
+    char        ssl_cipher[NAMEDATALEN];
+    char        ssl_client_dn[NAMEDATALEN];
+
+    /*
+     * serial number is max "20 octets" per RFC 5280, so this size should be
+     * fine
+     */
+    char        ssl_client_serial[NAMEDATALEN];
+
+    char        ssl_issuer_dn[NAMEDATALEN];
+} PgBackendSSLStatus;
+
+
+/* ----------
+ * PgBackendStatus
+ *
+ * Each live backend maintains a PgBackendStatus struct in shared memory
+ * showing its current activity.  (The structs are allocated according to
+ * BackendId, but that is not critical.)  Note that the collector process
+ * has no involvement in, or even access to, these structs.
+ *
+ * Each auxiliary process also maintains a PgBackendStatus struct in shared
+ * memory.
+ * ----------
+ */
+typedef struct PgBackendStatus
+{
+    /*
+     * To avoid locking overhead, we use the following protocol: a backend
+     * increments st_changecount before modifying its entry, and again after
+     * finishing a modification.  A would-be reader should note the value of
+     * st_changecount, copy the entry into private memory, then check
+     * st_changecount again.  If the value hasn't changed, and if it's even,
+     * the copy is valid; otherwise start over.  This makes updates cheap
+     * while reads are potentially expensive, but that's the tradeoff we want.
+     *
+     * The above protocol needs the memory barriers to ensure that the
+     * apparent order of execution is as it desires. Otherwise, for example,
+     * the CPU might rearrange the code so that st_changecount is incremented
+     * twice before the modification on a machine with weak memory ordering.
+     * This surprising result can lead to bugs.
+     */
+    int            st_changecount;
+
+    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
+    int            st_procpid;
+
+    /* Type of backends */
+    BackendType st_backendType;
+
+    /* Times when current backend, transaction, and activity started */
+    TimestampTz st_proc_start_timestamp;
+    TimestampTz st_xact_start_timestamp;
+    TimestampTz st_activity_start_timestamp;
+    TimestampTz st_state_start_timestamp;
+
+    /* Database OID, owning user's OID, connection client address */
+    Oid            st_databaseid;
+    Oid            st_userid;
+    SockAddr    st_clientaddr;
+    char       *st_clienthostname;    /* MUST be null-terminated */
+
+    /* Information about SSL connection */
+    bool        st_ssl;
+    PgBackendSSLStatus *st_sslstatus;
+
+    /* current state */
+    BackendState st_state;
+
+    /* application name; MUST be null-terminated */
+    char       *st_appname;
+
+    /*
+     * Current command string; MUST be null-terminated. Note that this string
+     * possibly is truncated in the middle of a multi-byte character. As
+     * activity strings are stored more frequently than read, that allows to
+     * move the cost of correct truncation to the display side. Use
+     * pgstat_clip_activity() to truncate correctly.
+     */
+    char       *st_activity_raw;
+
+    /*
+     * Command progress reporting.  Any command which wishes can advertise
+     * that it is running by setting st_progress_command,
+     * st_progress_command_target, and st_progress_param[].
+     * st_progress_command_target should be the OID of the relation which the
+     * command targets (we assume there's just one, as this is meant for
+     * utility commands), but the meaning of each element in the
+     * st_progress_param array is command-specific.
+     */
+    ProgressCommandType st_progress_command;
+    Oid            st_progress_command_target;
+    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
+} PgBackendStatus;
+
+/*
+ * Macros to load and store st_changecount with the memory barriers.
+ *
+ * pgstat_increment_changecount_before() and
+ * pgstat_increment_changecount_after() need to be called before and after
+ * PgBackendStatus entries are modified, respectively. This makes sure that
+ * st_changecount is incremented around the modification.
+ *
+ * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
+ * need to be called before and after PgBackendStatus entries are copied into
+ * private memory, respectively.
+ */
+#define pgstat_increment_changecount_before(beentry)    \
+    do {    \
+        beentry->st_changecount++;    \
+        pg_write_barrier(); \
+    } while (0)
+
+#define pgstat_increment_changecount_after(beentry) \
+    do {    \
+        pg_write_barrier(); \
+        beentry->st_changecount++;    \
+        Assert((beentry->st_changecount & 1) == 0); \
+    } while (0)
+
+#define pgstat_save_changecount_before(beentry, save_changecount)    \
+    do {    \
+        save_changecount = beentry->st_changecount; \
+        pg_read_barrier();    \
+    } while (0)
+
+#define pgstat_save_changecount_after(beentry, save_changecount)    \
+    do {    \
+        pg_read_barrier();    \
+        save_changecount = beentry->st_changecount; \
+    } while (0)
+
+/* ----------
+ * LocalPgBackendStatus
+ *
+ * When we build the backend status array, we use LocalPgBackendStatus to be
+ * able to add new values to the struct when needed without adding new fields
+ * to the shared memory. It contains the backend status as a first member.
+ * ----------
+ */
+typedef struct LocalPgBackendStatus
+{
+    /*
+     * Local version of the backend status entry.
+     */
+    PgBackendStatus backendStatus;
+
+    /*
+     * The xid of the current transaction if available, InvalidTransactionId
+     * if not.
+     */
+    TransactionId backend_xid;
+
+    /*
+     * The xmin of the current session if available, InvalidTransactionId if
+     * not.
+     */
+    TransactionId backend_xmin;
+} LocalPgBackendStatus;
+
+/* ----------
+ * GUC parameters
+ * ----------
+ */
+extern bool pgstat_track_activities;
+extern PGDLLIMPORT int pgstat_track_activity_query_size;
+
+/* ----------
+ * Functions called from backends
+ * ----------
+ */
+extern void pgstat_bearray_initialize(void);
+extern void pgstat_bestart(void);
+
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+extern char *pgstat_clip_activity(const char *raw_activity);
+
+extern void AtEOXact_BEStatus(bool isCommit);
+extern void AtPrepare_BEStatus(void);
+/* ----------
+ * pgstat_report_wait_start() -
+ *
+ *    Called from places where server process needs to wait.  This is called
+ *    to report wait event information.  The wait information is stored
+ *    as 4-bytes where first byte represents the wait event class (type of
+ *    wait, for different types of wait, refer WaitClass) and the next
+ *    3-bytes represent the actual wait event.  Currently 2-bytes are used
+ *    for wait event which is sufficient for current usage, 1-byte is
+ *    reserved for future usage.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_start(uint32 wait_event_info)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = wait_event_info;
+}
+
+/* ----------
+ * pgstat_report_wait_end() -
+ *
+ *    Called to report end of a wait.
+ *
+ * NB: this *must* be able to survive being called before MyProc has been
+ * initialized.
+ * ----------
+ */
+static inline void
+pgstat_report_wait_end(void)
+{
+    volatile PGPROC *proc = MyProc;
+
+    if (!pgstat_track_activities || !proc)
+        return;
+
+    /*
+     * Since this is a four-byte field which is always read and written as
+     * four-bytes, updates are atomic.
+     */
+    proc->wait_event_info = 0;
+}
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+
+/* For shared memory allocation/initialize */
+extern Size BackendStatusShmemSize(void);
+extern void CreateSharedBackendStatus(void);
+
+void pgstat_report_xact_timestamp(TimestampTz tstamp);
+void pgstat_bestat_initialize(void);
+
+extern void pgstat_report_activity(BackendState state, const char *cmd_str);
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
+
+#endif                            /* BESTATUS_H */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..49131a6d5b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
@@ -403,7 +404,6 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 471877d2df..90464769db 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,11 +13,10 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
-#include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
-#include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +40,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_BgWriter
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -420,40 +162,16 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_buf_alloc;
     PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
     PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+} PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -568,6 +213,12 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
+typedef struct PgStat_DSHash
+{
+    int        refcnt;
+    dshash_table_handle handle;
+} PgStat_DSHash;
+
 /* ----------
  * PgStat_StatDBEntry            The collector's data per database
  * ----------
@@ -597,17 +248,22 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats  update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
 } PgStat_StatDBEntry;
 
-
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
  * ----------
@@ -645,7 +301,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -659,8 +315,9 @@ typedef struct PgStat_StatFuncEntry
 } PgStat_StatFuncEntry;
 
 
+
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +333,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -694,432 +351,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-
-/* ----------
- * Backend types
- * ----------
- */
-typedef enum BackendType
-{
-    B_AUTOVAC_LAUNCHER,
-    B_AUTOVAC_WORKER,
-    B_BACKEND,
-    B_BG_WORKER,
-    B_BG_WRITER,
-    B_ARCHIVER,
-    B_CHECKPOINTER,
-    B_STARTUP,
-    B_WAL_RECEIVER,
-    B_WAL_SENDER,
-    B_WAL_WRITER
-} BackendType;
-
-
-/* ----------
- * Backend states
- * ----------
- */
-typedef enum BackendState
-{
-    STATE_UNDEFINED,
-    STATE_IDLE,
-    STATE_RUNNING,
-    STATE_IDLEINTRANSACTION,
-    STATE_FASTPATH,
-    STATE_IDLEINTRANSACTION_ABORTED,
-    STATE_DISABLED
-} BackendState;
-
-
-/* ----------
- * Wait Classes
- * ----------
- */
-#define PG_WAIT_LWLOCK                0x01000000U
-#define PG_WAIT_LOCK                0x03000000U
-#define PG_WAIT_BUFFER_PIN            0x04000000U
-#define PG_WAIT_ACTIVITY            0x05000000U
-#define PG_WAIT_CLIENT                0x06000000U
-#define PG_WAIT_EXTENSION            0x07000000U
-#define PG_WAIT_IPC                    0x08000000U
-#define PG_WAIT_TIMEOUT                0x09000000U
-#define PG_WAIT_IO                    0x0A000000U
-
-/* ----------
- * Wait Events - Activity
- *
- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_ARCHIVER_MAIN = PG_WAIT_ACTIVITY,
-    WAIT_EVENT_AUTOVACUUM_MAIN,
-    WAIT_EVENT_BGWRITER_HIBERNATE,
-    WAIT_EVENT_BGWRITER_MAIN,
-    WAIT_EVENT_CHECKPOINTER_MAIN,
-    WAIT_EVENT_LOGICAL_APPLY_MAIN,
-    WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
-    WAIT_EVENT_RECOVERY_WAL_ALL,
-    WAIT_EVENT_RECOVERY_WAL_STREAM,
-    WAIT_EVENT_SYSLOGGER_MAIN,
-    WAIT_EVENT_WAL_RECEIVER_MAIN,
-    WAIT_EVENT_WAL_SENDER_MAIN,
-    WAIT_EVENT_WAL_WRITER_MAIN
-} WaitEventActivity;
-
-/* ----------
- * Wait Events - Client
- *
- * Use this category when a process is waiting to send data to or receive data
- * from the frontend process to which it is connected.  This is never used for
- * a background process, which has no client connection.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_CLIENT_READ = PG_WAIT_CLIENT,
-    WAIT_EVENT_CLIENT_WRITE,
-    WAIT_EVENT_LIBPQWALRECEIVER_CONNECT,
-    WAIT_EVENT_LIBPQWALRECEIVER_RECEIVE,
-    WAIT_EVENT_SSL_OPEN_SERVER,
-    WAIT_EVENT_WAL_RECEIVER_WAIT_START,
-    WAIT_EVENT_WAL_SENDER_WAIT_WAL,
-    WAIT_EVENT_WAL_SENDER_WRITE_DATA
-} WaitEventClient;
-
-/* ----------
- * Wait Events - IPC
- *
- * Use this category when a process cannot complete the work it is doing because
- * it is waiting for a notification from another process.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BGWORKER_SHUTDOWN = PG_WAIT_IPC,
-    WAIT_EVENT_BGWORKER_STARTUP,
-    WAIT_EVENT_BTREE_PAGE,
-    WAIT_EVENT_CLOG_GROUP_UPDATE,
-    WAIT_EVENT_EXECUTE_GATHER,
-    WAIT_EVENT_HASH_BATCH_ALLOCATING,
-    WAIT_EVENT_HASH_BATCH_ELECTING,
-    WAIT_EVENT_HASH_BATCH_LOADING,
-    WAIT_EVENT_HASH_BUILD_ALLOCATING,
-    WAIT_EVENT_HASH_BUILD_ELECTING,
-    WAIT_EVENT_HASH_BUILD_HASHING_INNER,
-    WAIT_EVENT_HASH_BUILD_HASHING_OUTER,
-    WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BATCHES_DECIDING,
-    WAIT_EVENT_HASH_GROW_BATCHES_ELECTING,
-    WAIT_EVENT_HASH_GROW_BATCHES_FINISHING,
-    WAIT_EVENT_HASH_GROW_BATCHES_REPARTITIONING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ALLOCATING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_ELECTING,
-    WAIT_EVENT_HASH_GROW_BUCKETS_REINSERTING,
-    WAIT_EVENT_LOGICAL_SYNC_DATA,
-    WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
-    WAIT_EVENT_MQ_INTERNAL,
-    WAIT_EVENT_MQ_PUT_MESSAGE,
-    WAIT_EVENT_MQ_RECEIVE,
-    WAIT_EVENT_MQ_SEND,
-    WAIT_EVENT_PARALLEL_BITMAP_SCAN,
-    WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN,
-    WAIT_EVENT_PARALLEL_FINISH,
-    WAIT_EVENT_PROCARRAY_GROUP_UPDATE,
-    WAIT_EVENT_PROMOTE,
-    WAIT_EVENT_REPLICATION_ORIGIN_DROP,
-    WAIT_EVENT_REPLICATION_SLOT_DROP,
-    WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
-} WaitEventIPC;
-
-/* ----------
- * Wait Events - Timeout
- *
- * Use this category when a process is waiting for a timeout to expire.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BASE_BACKUP_THROTTLE = PG_WAIT_TIMEOUT,
-    WAIT_EVENT_PG_SLEEP,
-    WAIT_EVENT_RECOVERY_APPLY_DELAY
-} WaitEventTimeout;
-
-/* ----------
- * Wait Events - IO
- *
- * Use this category when a process is waiting for a IO.
- * ----------
- */
-typedef enum
-{
-    WAIT_EVENT_BUFFILE_READ = PG_WAIT_IO,
-    WAIT_EVENT_BUFFILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_READ,
-    WAIT_EVENT_CONTROL_FILE_SYNC,
-    WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
-    WAIT_EVENT_CONTROL_FILE_WRITE,
-    WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE,
-    WAIT_EVENT_COPY_FILE_READ,
-    WAIT_EVENT_COPY_FILE_WRITE,
-    WAIT_EVENT_DATA_FILE_EXTEND,
-    WAIT_EVENT_DATA_FILE_FLUSH,
-    WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC,
-    WAIT_EVENT_DATA_FILE_PREFETCH,
-    WAIT_EVENT_DATA_FILE_READ,
-    WAIT_EVENT_DATA_FILE_SYNC,
-    WAIT_EVENT_DATA_FILE_TRUNCATE,
-    WAIT_EVENT_DATA_FILE_WRITE,
-    WAIT_EVENT_DSM_FILL_ZERO_WRITE,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_READ,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_SYNC,
-    WAIT_EVENT_LOCK_FILE_ADDTODATADIR_WRITE,
-    WAIT_EVENT_LOCK_FILE_CREATE_READ,
-    WAIT_EVENT_LOCK_FILE_CREATE_SYNC,
-    WAIT_EVENT_LOCK_FILE_CREATE_WRITE,
-    WAIT_EVENT_LOCK_FILE_RECHECKDATADIR_READ,
-    WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE,
-    WAIT_EVENT_LOGICAL_REWRITE_SYNC,
-    WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE,
-    WAIT_EVENT_LOGICAL_REWRITE_WRITE,
-    WAIT_EVENT_RELATION_MAP_READ,
-    WAIT_EVENT_RELATION_MAP_SYNC,
-    WAIT_EVENT_RELATION_MAP_WRITE,
-    WAIT_EVENT_REORDER_BUFFER_READ,
-    WAIT_EVENT_REORDER_BUFFER_WRITE,
-    WAIT_EVENT_REORDER_LOGICAL_MAPPING_READ,
-    WAIT_EVENT_REPLICATION_SLOT_READ,
-    WAIT_EVENT_REPLICATION_SLOT_RESTORE_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_SYNC,
-    WAIT_EVENT_REPLICATION_SLOT_WRITE,
-    WAIT_EVENT_SLRU_FLUSH_SYNC,
-    WAIT_EVENT_SLRU_READ,
-    WAIT_EVENT_SLRU_SYNC,
-    WAIT_EVENT_SLRU_WRITE,
-    WAIT_EVENT_SNAPBUILD_READ,
-    WAIT_EVENT_SNAPBUILD_SYNC,
-    WAIT_EVENT_SNAPBUILD_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_FILE_WRITE,
-    WAIT_EVENT_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_TIMELINE_HISTORY_SYNC,
-    WAIT_EVENT_TIMELINE_HISTORY_WRITE,
-    WAIT_EVENT_TWOPHASE_FILE_READ,
-    WAIT_EVENT_TWOPHASE_FILE_SYNC,
-    WAIT_EVENT_TWOPHASE_FILE_WRITE,
-    WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
-    WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
-    WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
-    WAIT_EVENT_WAL_COPY_READ,
-    WAIT_EVENT_WAL_COPY_SYNC,
-    WAIT_EVENT_WAL_COPY_WRITE,
-    WAIT_EVENT_WAL_INIT_SYNC,
-    WAIT_EVENT_WAL_INIT_WRITE,
-    WAIT_EVENT_WAL_READ,
-    WAIT_EVENT_WAL_SYNC,
-    WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN,
-    WAIT_EVENT_WAL_WRITE
-} WaitEventIO;
-
-/* ----------
- * Command type for progress reporting purposes
- * ----------
- */
-typedef enum ProgressCommandType
-{
-    PROGRESS_COMMAND_INVALID,
-    PROGRESS_COMMAND_VACUUM
-} ProgressCommandType;
-
-#define PGSTAT_NUM_PROGRESS_PARAM    10
-
-/* ----------
- * Shared-memory data structures
- * ----------
- */
-
-
-/*
- * PgBackendSSLStatus
- *
- * For each backend, we keep the SSL status in a separate struct, that
- * is only filled in if SSL is enabled.
- *
- * All char arrays must be null-terminated.
- */
-typedef struct PgBackendSSLStatus
-{
-    /* Information about SSL connection */
-    int            ssl_bits;
-    bool        ssl_compression;
-    char        ssl_version[NAMEDATALEN];
-    char        ssl_cipher[NAMEDATALEN];
-    char        ssl_client_dn[NAMEDATALEN];
-
-    /*
-     * serial number is max "20 octets" per RFC 5280, so this size should be
-     * fine
-     */
-    char        ssl_client_serial[NAMEDATALEN];
-
-    char        ssl_issuer_dn[NAMEDATALEN];
-} PgBackendSSLStatus;
-
-
-/* ----------
- * PgBackendStatus
- *
- * Each live backend maintains a PgBackendStatus struct in shared memory
- * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
- * has no involvement in, or even access to, these structs.
- *
- * Each auxiliary process also maintains a PgBackendStatus struct in shared
- * memory.
- * ----------
- */
-typedef struct PgBackendStatus
-{
-    /*
-     * To avoid locking overhead, we use the following protocol: a backend
-     * increments st_changecount before modifying its entry, and again after
-     * finishing a modification.  A would-be reader should note the value of
-     * st_changecount, copy the entry into private memory, then check
-     * st_changecount again.  If the value hasn't changed, and if it's even,
-     * the copy is valid; otherwise start over.  This makes updates cheap
-     * while reads are potentially expensive, but that's the tradeoff we want.
-     *
-     * The above protocol needs the memory barriers to ensure that the
-     * apparent order of execution is as it desires. Otherwise, for example,
-     * the CPU might rearrange the code so that st_changecount is incremented
-     * twice before the modification on a machine with weak memory ordering.
-     * This surprising result can lead to bugs.
-     */
-    int            st_changecount;
-
-    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
-    int            st_procpid;
-
-    /* Type of backends */
-    BackendType st_backendType;
-
-    /* Times when current backend, transaction, and activity started */
-    TimestampTz st_proc_start_timestamp;
-    TimestampTz st_xact_start_timestamp;
-    TimestampTz st_activity_start_timestamp;
-    TimestampTz st_state_start_timestamp;
-
-    /* Database OID, owning user's OID, connection client address */
-    Oid            st_databaseid;
-    Oid            st_userid;
-    SockAddr    st_clientaddr;
-    char       *st_clienthostname;    /* MUST be null-terminated */
-
-    /* Information about SSL connection */
-    bool        st_ssl;
-    PgBackendSSLStatus *st_sslstatus;
-
-    /* current state */
-    BackendState st_state;
-
-    /* application name; MUST be null-terminated */
-    char       *st_appname;
-
-    /*
-     * Current command string; MUST be null-terminated. Note that this string
-     * possibly is truncated in the middle of a multi-byte character. As
-     * activity strings are stored more frequently than read, that allows to
-     * move the cost of correct truncation to the display side. Use
-     * pgstat_clip_activity() to truncate correctly.
-     */
-    char       *st_activity_raw;
-
-    /*
-     * Command progress reporting.  Any command which wishes can advertise
-     * that it is running by setting st_progress_command,
-     * st_progress_command_target, and st_progress_param[].
-     * st_progress_command_target should be the OID of the relation which the
-     * command targets (we assume there's just one, as this is meant for
-     * utility commands), but the meaning of each element in the
-     * st_progress_param array is command-specific.
-     */
-    ProgressCommandType st_progress_command;
-    Oid            st_progress_command_target;
-    int64        st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];
-} PgBackendStatus;
-
-/*
- * Macros to load and store st_changecount with the memory barriers.
- *
- * pgstat_increment_changecount_before() and
- * pgstat_increment_changecount_after() need to be called before and after
- * PgBackendStatus entries are modified, respectively. This makes sure that
- * st_changecount is incremented around the modification.
- *
- * Also pgstat_save_changecount_before() and pgstat_save_changecount_after()
- * need to be called before and after PgBackendStatus entries are copied into
- * private memory, respectively.
- */
-#define pgstat_increment_changecount_before(beentry)    \
-    do {    \
-        beentry->st_changecount++;    \
-        pg_write_barrier(); \
-    } while (0)
-
-#define pgstat_increment_changecount_after(beentry) \
-    do {    \
-        pg_write_barrier(); \
-        beentry->st_changecount++;    \
-        Assert((beentry->st_changecount & 1) == 0); \
-    } while (0)
-
-#define pgstat_save_changecount_before(beentry, save_changecount)    \
-    do {    \
-        save_changecount = beentry->st_changecount; \
-        pg_read_barrier();    \
-    } while (0)
-
-#define pgstat_save_changecount_after(beentry, save_changecount)    \
-    do {    \
-        pg_read_barrier();    \
-        save_changecount = beentry->st_changecount; \
-    } while (0)
-
-/* ----------
- * LocalPgBackendStatus
- *
- * When we build the backend status array, we use LocalPgBackendStatus to be
- * able to add new values to the struct when needed without adding new fields
- * to the shared memory. It contains the backend status as a first member.
- * ----------
- */
-typedef struct LocalPgBackendStatus
-{
-    /*
-     * Local version of the backend status entry.
-     */
-    PgBackendStatus backendStatus;
-
-    /*
-     * The xid of the current transaction if available, InvalidTransactionId
-     * if not.
-     */
-    TransactionId backend_xid;
-
-    /*
-     * The xmin of the current session if available, InvalidTransactionId if
-     * not.
-     */
-    TransactionId backend_xmin;
-} LocalPgBackendStatus;
-
 /*
  * Working state needed to accumulate per-function-call timing statistics.
  */
@@ -1141,18 +372,18 @@ typedef struct PgStat_FunctionCallUsage
  * GUC parameters
  * ----------
  */
-extern bool pgstat_track_activities;
 extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
-extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1164,34 +395,28 @@ extern PgStat_Counter pgStatBlockWriteTime;
  * Functions called from postmaster
  * ----------
  */
-extern Size BackendStatusShmemSize(void);
-extern void CreateSharedBackendStatus(void);
+extern void pgstat_initialize(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_flush_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
-extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1201,89 +426,13 @@ extern void pgstat_report_analyze(Relation rel,
 
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
-
-extern void pgstat_initialize(void);
-extern void pgstat_bestart(void);
-
-extern void pgstat_report_activity(BackendState state, const char *cmd_str);
 extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
-
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
 
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
-extern char *pgstat_clip_activity(const char *raw_activity);
-
-/* ----------
- * pgstat_report_wait_start() -
- *
- *    Called from places where server process needs to wait.  This is called
- *    to report wait event information.  The wait information is stored
- *    as 4-bytes where first byte represents the wait event class (type of
- *    wait, for different types of wait, refer WaitClass) and the next
- *    3-bytes represent the actual wait event.  Currently 2-bytes are used
- *    for wait event which is sufficient for current usage, 1-byte is
- *    reserved for future usage.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_start(uint32 wait_event_info)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = wait_event_info;
-}
-
-/* ----------
- * pgstat_report_wait_end() -
- *
- *    Called to report end of a wait.
- *
- * NB: this *must* be able to survive being called before MyProc has been
- * initialized.
- * ----------
- */
-static inline void
-pgstat_report_wait_end(void)
-{
-    volatile PGPROC *proc = MyProc;
-
-    if (!pgstat_track_activities || !proc)
-        return;
-
-    /*
-     * Since this is a four-byte field which is always read and written as
-     * four-bytes, updates are atomic.
-     */
-    proc->wait_event_info = 0;
-}
-
 /* nontransactional event counts are simple enough to inline */
 
 #define pgstat_count_heap_scan(rel)                                    \
@@ -1348,21 +497,22 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                           void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_update_archiver(const char *xlog, bool failed);
+extern void pgstat_update_bgwriter(void);
 
+
+
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid, bool oneshot);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(
+    PgStat_StatDBEntry *dbent, Oid reloid, bool oneshot);
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
-extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
-extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-
+extern void backend_clear_stats_snapshot(void);
+extern PgStat_StatTabEntry *backend_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatFuncEntry *backend_fetch_stat_funcentry(Oid funcid);
+extern PgStat_ArchiverStats *backend_fetch_stat_archiver(void);
+extern PgStat_GlobalStats *backend_fetch_global(void);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/dsm.h b/src/include/storage/dsm.h
index 7c44f4a6e7..c37ec33e9b 100644
--- a/src/include/storage/dsm.h
+++ b/src/include/storage/dsm.h
@@ -26,6 +26,7 @@ typedef struct dsm_segment dsm_segment;
 struct PGShmemHeader;            /* avoid including pg_shmem.h */
 extern void dsm_cleanup_using_control_segment(dsm_handle old_control_handle);
 extern void dsm_postmaster_startup(struct PGShmemHeader *);
+extern void dsm_child_init(void);
 extern void dsm_backend_shutdown(void);
 extern void dsm_detach_all(void);
 
@@ -33,6 +34,8 @@ extern void dsm_detach_all(void);
 extern void dsm_set_control_handle(dsm_handle h);
 #endif
 
+extern bool dsm_is_available(void);
+
 /* Functions that create or remove mappings. */
 extern dsm_segment *dsm_create(Size size, int flags);
 extern dsm_segment *dsm_attach(dsm_handle h);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..daa269f816 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c
index c1878dd694..7391e05f37 100644
--- a/src/test/modules/worker_spi/worker_spi.c
+++ b/src/test/modules/worker_spi/worker_spi.c
@@ -290,7 +290,7 @@ worker_spi_main(Datum main_arg)
         SPI_finish();
         PopActiveSnapshot();
         CommitTransactionCommand();
-        pgstat_report_stat(false);
+        pgstat_update_stat(false);
         pgstat_report_activity(STATE_IDLE, NULL);
     }
 
-- 
2.16.3

From 309306301de996b86f8a60d484aa8836e709a61d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/statmon/pgstat.c                  | 13 ++++-----
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..79f704cc99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6709,25 +6709,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5c1408bdf5..b538799ff6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -197,12 +197,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index cbdad0c3fb..133eb3ff19 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index e30b2dbcf0..a567aacf73 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -231,11 +231,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -266,13 +263,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/statmon/pgstat.c b/src/backend/statmon/pgstat.c
index 3849c6ec05..3108cf9c5e 100644
--- a/src/backend/statmon/pgstat.c
+++ b/src/backend/statmon/pgstat.c
@@ -89,15 +89,12 @@ typedef enum PgStat_TableLookupState
 bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap information */
 typedef struct StatsShmemStruct {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 62a07727d0..49123204c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -192,7 +192,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3989,17 +3988,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11003,35 +10991,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..fdb088dbfd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,7 +554,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 90464769db..f0804013db 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -29,7 +29,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
At Fri, 15 Feb 2019 15:53:28 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190215185328.GA29663@alvherre.pgsql>
> On 2019-Feb-15, Andres Freund wrote:
> 
> > It's fine to do such renames, just do them as separate patches. It's
> > hard enough to review changes this big...
> 
> Talk about moving the whole file to another subdir ...

Sounds reasonable. It was a searate patch at the first but
currently melded to bloat the patch. I'm going to revert the
separation/moving.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 20 Feb 2019 15:45:17 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190220.154517.24528798.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 15 Feb 2019 15:53:28 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190215185328.GA29663@alvherre.pgsql>
> > Talk about moving the whole file to another subdir ...
> 
> Sounds reasonable. It was a searate patch at the first but
> currently melded to bloat the patch. I'm going to revert the
> separation/moving.

Done. This verison 16 looks as if the moving and splitting were
not happen. Major changes are:

 - Restored old pgstats_* names. This largily shrinks the patch
   size to less than a half lines of v15.  More than that, it
   gets easier to examine differences. (checkpointer.c and
   bgwriter.c have a bit stale comments but it is an issue for
   later.)

 - Removed "oneshot" feature at all. This simplifies pgstat API
   and let this patch far less simple.

 - Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
   be in the main tranche.

 - Fixed several bugs revealed by the shrinked size of the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5e52ba57f3708c3411fff836473e8c1a572d063e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..1e8c22f94f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From cbd96a824d077b8ea5b77dc8a8c3fc1f4d4e74a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/6] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  4 +++
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1e8c22f94f..303210e326 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..21587c07ce 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                       const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From df17a84d5e79f844f1ceafc1ad6c673779800844 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/6] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..b0878a3dd9 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -328,6 +328,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -455,6 +458,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..8d3c45dd4e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2856,6 +2856,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4105,6 +4108,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..820f356038 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -538,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1761,7 +1763,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2924,7 +2926,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3069,10 +3071,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3318,7 +3318,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3523,6 +3523,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3799,6 +3811,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5068,7 +5081,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5342,6 +5355,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..3324be8a81 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -701,6 +701,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From a0c8ff34a3e247d13869ce4507fe44b34fa47507 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:42:07 +0900
Subject: [PATCH 4/6] Allow dsm to use on postmaster.

DSM is inhibited to be used on postmaster. Shared memory baesd stats
collector needs it to work on postmaster and no problem found to do
that. Just allow it.
---
 src/backend/storage/ipc/dsm.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 23ccc59f13..d30a876bb0 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -440,8 +440,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_control != NULL);
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +536,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_control != NULL);
 
     if (!dsm_init_done)
         dsm_backend_startup();
-- 
2.16.3

From 5259e756167e8cd355d8c53826f64178fc8d14b8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 5/6] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |    6 +-
 src/backend/postmaster/pgstat.c              | 5435 +++++++++++---------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    2 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  469 +--
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2478 insertions(+), 3574 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5c1408bdf5..83a0be1965 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index d1177b3855..1b7429217a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8d3c45dd4e..ba4c274f9a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,19 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  Collects per-table and per-function usage statistics of backends and shares
+ *  them among all backends via shared memory. Every backend records
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_report_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared
+ *  memory. To avoid congestion on the shared memory, we do that not often
+ *  than PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, such numbers are
+ *  postponed to the next chances with the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,88 +23,47 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -116,6 +79,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define    PGSTAT_FETCH_SHARED        0
+#define    PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT        2
+
+typedef enum PgStat_TableLookupState
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} PgStat_TableLookupState;
 
 /* ----------
  * GUC parameters
@@ -131,9 +107,23 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+LWLock        StatsMainLock;
+#define        StatsLock (&StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_hash_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -145,17 +135,38 @@ PgStat_MsgBgWriter BgWriterStats;
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash;
+static MemoryContext pgSharedStatsContext = NULL;
 
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -190,8 +201,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -201,6 +212,46 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* dbentry has some additional data in snapshot */
+typedef struct PgStat_StatDBEntry_snapshot
+{
+    PgStat_StatDBEntry shared_part;
+
+    HTAB *snapshot_tables;                /* table entry snapshot */
+    HTAB *snapshot_functions;            /* function entry snapshot */
+    dshash_table    *dshash_tables;        /* attached tables dshash */
+    dshash_table    *dshash_functions;    /* attached functions dshash */
+} PgStat_StatDBEntry_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_cxt
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    HTAB          **hash;                /* placeholder for the hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table  **dshash;                /* placeholder for attached dshash */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+} pgstat_snapshot_cxt;
+
+/*
+ *  Backends store various database-wide info that's waiting to be flushed out
+ *  to shared memory in these variables.
+ */
+static int        n_deadlocks = 0;
+static size_t    n_tmpfiles = 0;
+static size_t    n_tmpfilesize = 0;
+
+/*
+ * have_recovery_conflicts represents the existence of any kind if conflict
+ */
+static bool        have_recovery_conflicts = false;
+static int        n_conflict_tablespace = 0;
+static int        n_conflict_lock = 0;
+static int        n_conflict_snapshot = 0;
+static int        n_conflict_bufferpin = 0;
+static int        n_conflict_startup_deadlock = 0;
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,36 +287,41 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot. The snapshot includes auxiliary. */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
 static int    localNumBackends = 0;
 
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+/* Variables for activity statistics snapshot. */
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatDBEntrySnapshot;
+static TimestampTz snapshot_expires_at = 0; /* local cache expiration time */
+static bool        first_in_xact = true;      /* is the first time in this xact? */
+
+
+/* Context struct for flushing to shared memory */
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by shared statistics code and
+ * snapshot_* are backend snapshots.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -279,35 +335,36 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupState *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_miscstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -315,480 +372,134 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
 }
 
-/*
- * subroutine for pgstat_reset_all
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_create_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS);
+    dsa_pin_mapping(area);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    /* create the database hash */
+    pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_hash_handle =
+        dshash_get_hash_table_handle(pgStatDBHash);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    MemoryContextSwitchTo(pgSharedStatsContext);
+    pgStatDBEntrySnapshot = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Clear on-memory counters.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
 
-#ifdef EXEC_BACKEND
+    Assert (pgStatDBHash);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
+        dshash_release_lock(pgStatDBHash, dbentry);
     }
 
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    /*
+     * Reset global counters
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+    MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    LWLockRelease(StatsLock);
 }
 
 /* ------------------------------------------------------------
@@ -796,75 +507,257 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_report_stat() -
+ * pgstat_flush_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up by calling
+ *    this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *    has elapsed since last cleanup. On the other hand updates by regular
+ *    backends happen with the interval not shorter than
+ *    PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns the time until the next update time in milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz last_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        have_other_stats = false;
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+
+    /* Do we have anything to flush? */
+    if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
+        have_other_stats = true;
 
     /* Don't expend a clock check if nothing to do */
     if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+        !have_other_stats && !have_function_stats)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since the last flush.  Returns time
+         * to wait in the case.
+         */
+        TimestampDifference(last_flush, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            if (pending_since == 0)
+                pending_since = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* It's the time to flush */
+    last_flush = now;
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out miscellaneous stats */
+    if (have_other_stats && !pgstat_flush_miscstats(&cxt, !force))
+        pending_stats = true;
+
+    /*  Unpin dbentry if pinned */
+    if (cxt.mydbentry)
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_flush)
+        StatsShmem->last_update = last_flush;
+    LWLockRelease(StatsLock);
+
+    /* record how long we keep pending stats */
+    if (pending_stats)
+    {
+        if (pending_since == 0)
+            pending_since = now;
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    pending_since = 0;
+
+    return 0;
+}
+
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash with cache.
+ *
+ * Returns the entry for key or NULL if not found.
+ *
+ * Returned entries are consistent during the current transaction or
+ * pgstat_clear_snapshot() is called.
+ *
+ * *cxt->hash points to a HTAB* variable to store the hash for local cache. New
+ * one is created if it is not yet created.
+ *
+ * *cxt->dshash points to dshash_table* variable to store the attached
+ * dshash. *cxt->dsh_handle is * attached if not yet attached.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
+    bool *negative;
+
+    /* caches the result entry */
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * Create new hash with arbitrary initial entries since we don't know how
+     * this hash will grow. The boolean put at the end of the entry is
+     * negative flag.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /* Create the hash in the stats context */
+        ctl.keysize        = keysize;
+        ctl.entrysize    = cxt->hash_entsize + sizeof(bool);
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    negative = (bool *) (lentry + cxt->hash_entsize);
+
+    if (!found)
+    {
+        /* not found in local cache, search shared hash */
+
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            if (cxt->dsh_handle == DSM_HANDLE_INVALID)
+                return NULL;
+
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /* found copy it */
+            memcpy(lentry, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the additional space */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet(lentry + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        *negative = !sentry;
+    }
+
+    if (*negative)
+        return NULL;
+
+    return (void *) lentry;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns with false if required lock was not acquired
+ *  immediately. In the case, infos of some tables may be left alone in TSA to
+ *  wait for the next chance. cxt holds some dshash related values that we
+ *  want to keep during the shared stats update.  Returns true if no stats
+ *  info remains. Caller must detach dshashes stored in cxt after use.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to. We recreate it if is needed.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared statistics.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -877,178 +770,344 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Leave it alone filling at the beginning in TSA.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment. There must be
+                 * enough space segments since we are just leaving some of the
+                 * current elements.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /* Move the entry if needed */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TSA hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries , but still may
+     * use that for my database.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /* We don't have corresponding dbentry here */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold dshash-lock on dbentries, since the dbentries cannot
+         * be dropped meanwhile.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We attach mydb tabhash once per flushing. This is the chance to
+             * update database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * dbentry is always available here, so try flush table stats first, then
+     * database stats.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure and leave some of the
+ *  entries alone in the local hash.
+ *
+ *  Returns true if all entries are flushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_FETCH_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared statistics.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_miscstats: Flushes out miscellaneous stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all the miscellaneous stats are flushed out.
+ */
+static bool
+pgstat_flush_miscstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* Lock failure, return. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (have_recovery_conflicts)
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    pgstat_snapshot_global_stats();
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1056,137 +1115,77 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+        dshtable =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1244,62 +1243,57 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(pgStatDBHash);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+        Assert(dbentry->refcnt == 0);
+
+        /* One one must live on this database. It's safe to drop all. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1308,20 +1302,31 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(pgStatDBHash);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1330,29 +1335,35 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,17 +1372,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1385,48 +1421,83 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1437,9 +1508,15 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1468,114 +1545,217 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    have_recovery_conflicts = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += n_conflict_tablespace;
+    dbentry->n_conflict_lock         += n_conflict_lock;
+    dbentry->n_conflict_snapshot    += n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += n_conflict_startup_deadlock;
+
+    n_conflict_tablespace = 0;
+    n_conflict_lock = 0;
+    n_conflict_snapshot = 0;
+    n_conflict_bufferpin = 0;
+    n_conflict_startup_deadlock = 0;
+
+    have_recovery_conflicts = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_deadlock(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush dead lock stats
+ */
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += n_deadlocks;
+    n_deadlocks = 0;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        n_tmpfilesize += filesize; /* needs check overflow */
+        n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += n_tmpfilesize;
+    dbentry->n_temp_files += n_tmpfiles;
+    n_tmpfilesize = 0;
+    n_tmpfiles = 0;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1726,7 +1906,7 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (!pgstat_track_counts || !IsUnderPostmaster)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1745,6 +1925,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1760,18 +1958,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2125,8 +2312,8 @@ AtEOXact_PgStat(bool isCommit)
     }
     pgStatXactStack = NULL;
 
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
 }
 
 /* ----------
@@ -2307,8 +2494,8 @@ PostPrepare_PgStat(void)
     }
     pgStatXactStack = NULL;
 
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
 }
 
 /*
@@ -2380,30 +2567,37 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "local database stats hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    cxt.dshash = &pgStatDBHash;
+    cxt.hash = &pgStatDBEntrySnapshot;
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(&cxt, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2416,51 +2610,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_tables;
+    cxt.dshash = &local_dbent->dshash_tables;
+    cxt.dsh_handle = dbent->tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(&cxt, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2475,21 +2684,125 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_functions;
+    cxt.dshash = &local_dbent->dshash_functions;
+    cxt.dsh_handle = dbent->functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(&cxt, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    TimestampTz update_time = 0;
+
+    /* The snapshot lives within CacheMemoryContext */
+    if (pgStatSnapshotContext == NULL)
+    {
+        pgStatSnapshotContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Stats snapshot context",
+                                  ALLOCSET_DEFAULT_SIZES);
+    }
+
+    /*
+     * Shared stats are updated frequently especially when many backends are
+     * running, but we don't want to reconstruct snapshot so frequently for
+     * performance reasons. Keep them at least for the same duration with
+     * minimal stats update interval of a backend. As the result snapshots may
+     * live for multiple transactions.
+     */
+    if (first_in_xact && IsTransactionState())
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (snapshot_expires_at < update_time)
+        {
+            /* No problem to expire involving backend status */
+            pgstat_clear_snapshot();
+
+            snapshot_expires_at =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2561,9 +2874,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2578,9 +2892,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2769,8 +3084,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2856,9 +3171,6 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -3231,7 +3543,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3498,9 +3811,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4151,75 +4461,39 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4228,6 +4502,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4236,11 +4512,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4249,298 +4532,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
+    return generation;
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /*
+     * It is isolated, waiting for all referrers to end.
+     */
+    Assert(dbentry->generation == generation + 1);
+
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
+}
+
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
+}
+
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
+
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * All counters are reset. Tables and functions dshashes are destroyed.  If
+ * any backend is pinning this dbentry, the current dshashes are stashed out to
+ * the previous "generation" to wait for all accessors gone. If the previous
+ * generation is already occupied, the current dshashes are so fresh that they
+ * doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4563,23 +4712,808 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Removing individual
+         * entries in dshash is very costly so just destroy it.  If someone
+         * pined this entry just after, pin_hashes returns the current
+         * generation and attach waits for the following LWLock.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Recreate now if needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
+}
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+/*
+ * Create the filename for a DB stat file; filename is the output buffer, of
+ * length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgSharedStatsContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stats area. No lock is needed since
+     * we're alone now.
+     */
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry   *tabentry;
+    PgStat_StatTabEntry        tabbuf;
+    PgStat_StatFuncEntry    funcbuf;
+    PgStat_StatFuncEntry   *funcentry;
+    dshash_table           *tabhash = NULL;
+    dshash_table           *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgSharedStatsContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+    if (!pgSharedStatsContext)
+        pgSharedStatsContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Shared activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+    {
+        MemoryContextReset(pgStatSnapshotContext);
+
+        /* mark as the resource are not allocated */
+        snapshot_globalStats = NULL;
+        snapshot_archiverStats = NULL;
+        pgStatDBEntrySnapshot = NULL;
+    }
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
@@ -4588,47 +5522,79 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupState *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    Assert(pgStatDBHash);
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4661,1685 +5627,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 820f356038..369d6dde63 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1302,12 +1300,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1756,11 +1748,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2595,8 +2582,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2927,8 +2912,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2995,13 +2978,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3076,22 +3052,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3550,22 +3510,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3761,8 +3705,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3801,8 +3743,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4003,8 +3944,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4977,18 +4916,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5101,12 +5028,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5976,7 +5897,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6029,8 +5949,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6262,7 +6180,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..97bca9be24 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -281,8 +282,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..004cb26f63 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..fc265a507e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4225,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4233,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..5e67b25e18 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..8939758c59 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..c67331138b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3324be8a81..cb70d00b8f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +148,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +166,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +205,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -568,6 +215,12 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
+typedef struct PgStat_DSHash
+{
+    int        refcnt;
+    dshash_table_handle handle;
+} PgStat_DSHash;
+
 /* ----------
  * PgStat_StatDBEntry            The collector's data per database
  * ----------
@@ -597,17 +250,22 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats  update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
 } PgStat_StatDBEntry;
 
-
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
  * ----------
@@ -645,7 +303,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -660,7 +318,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +334,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -762,7 +420,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1146,6 +803,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1167,31 +826,29 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern void pgstat_initialize(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1202,26 +859,10 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
-extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
-
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
 
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
+extern void pgstat_report_tempfile(size_t filesize);
 
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
@@ -1351,18 +992,38 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 
+
+
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid);
+
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern void pgstat_clear_snapshot(void);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..e2277d67a3 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 969528ea377452681fe38af9ce797eeb68742af6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 6/6] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..79f704cc99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6709,25 +6709,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 83a0be1965..bf94bd87d2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index cbdad0c3fb..133eb3ff19 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ba4c274f9a..8f8ac0b356 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -102,15 +102,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 LWLock        StatsMainLock;
 #define        StatsLock (&StatsMainLock)
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..58ba33e822 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..058d97075f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -191,7 +191,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3988,17 +3987,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11002,35 +10990,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..fdb088dbfd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,7 +554,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cb70d00b8f..265ed4d118 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Arthur Zakirov
Дата:
Hello,

On 21.02.2019 10:05, Kyotaro HORIGUCHI wrote:
> Done. This verison 16 looks as if the moving and splitting were
> not happen. Major changes are:
> 
>   - Restored old pgstats_* names. This largily shrinks the patch
>     size to less than a half lines of v15.  More than that, it
>     gets easier to examine differences. (checkpointer.c and
>     bgwriter.c have a bit stale comments but it is an issue for
>     later.)
> 
>   - Removed "oneshot" feature at all. This simplifies pgstat API
>     and let this patch far less simple.
> 
>   - Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
>     be in the main tranche.
> 
>   - Fixed several bugs revealed by the shrinked size of the patch.

I run regression tests. Unfortunately tests didn't pass, failed test is 
'rangetypes':

rangetypes ... FAILED (test process exited with exit code 2)

It seems to me that an autovacuum process terminates because of segfault.

Segfault occurs within get_pgstat_tabentry_relid(). If I'm not mistaken, 
somehow 'dbentry' hasn't valid pointer anymore.

'dbentry' is get in the line in do_autovacuum():

dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

'dbentry' becomes invalid after calling pgstat_vacuum_stat().

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello, Arthur.

At Thu, 21 Feb 2019 17:30:50 +0300, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote in
<db346d14-4130-57a5-5f46-9a57e9982bec@postgrespro.ru>
> Hello,
> 
> On 21.02.2019 10:05, Kyotaro HORIGUCHI wrote:
> > Done. This verison 16 looks as if the moving and splitting were
> > not happen. Major changes are:
> >   - Restored old pgstats_* names. This largily shrinks the patch
> >     size to less than a half lines of v15.  More than that, it
> >     gets easier to examine differences. (checkpointer.c and
> >     bgwriter.c have a bit stale comments but it is an issue for
> >     later.)
> >   - Removed "oneshot" feature at all. This simplifies pgstat API
> >     and let this patch far less simple.
> >   - Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
> >     be in the main tranche.
> >   - Fixed several bugs revealed by the shrinked size of the patch.
> 
> I run regression tests. Unfortunately tests didn't pass, failed test
> is 'rangetypes':
> 
> rangetypes ... FAILED (test process exited with exit code 2)
>
> It seems to me that an autovacuum process terminates because of
> segfault.
> 
> Segfault occurs within get_pgstat_tabentry_relid(). If I'm not
> mistaken, somehow 'dbentry' hasn't valid pointer anymore.
> 
> 'dbentry' is get in the line in do_autovacuum():
> 
> dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
> 
> 'dbentry' becomes invalid after calling pgstat_vacuum_stat().

Thank you very much for the report. I haven't seen the error, but
I think you gave me enough information about the issue. I try to
reproduce it.

I found another problem commit_ts test reliably fails by dshash
corruption in startup process. I've not found why and will
investigate it, too.

regarsds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Fri, 22 Feb 2019 17:19:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190222.171956.98584931.horiguchi.kyotaro@lab.ntt.co.jp>
> > It seems to me that an autovacuum process terminates because of
> > segfault.
> > 
> > Segfault occurs within get_pgstat_tabentry_relid(). If I'm not
> > mistaken, somehow 'dbentry' hasn't valid pointer anymore.

do_autovacuum does the followings:

  dbentry = pgstat_fetch_stat_dbentry()  -- create cached dbentry
  StartTransactionCommand()    -- starts transaction
  pgstat_vacuum_stat()         -- blows away the cached dbentry.
  shared = pgstat_fetch_stat_dbentry()

It was harmless previously, but pgstat_* functions blow away
local cache at the first call after transaction start. As the
result dbentry becomes invalid. The reason I didin't see the same
crash is the second pgstat_fetch_stat_dbentry() accidentially
zeroes-out the once invalidated dbentry.

It is fixed by moving StartTransactionCommand to before the first
pgstat_f_s_dbentry(), which looks better not having this problem.

me> I found another problem commit_ts test reliably fails by dshash
me> corruption in startup process. I've not found why and will
me> investigate it, too.

It is rather stupid, pgstat_reset_all() releases an entry within
the sequential scan loop, which contradicts the protocol of
dshash_seq_next.

The two aboves are fixed in the attached v17.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ff1eda551a637e4431804bcfd8119f8c7dc74636 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..1e8c22f94f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 05128b68e6d5e80fc8ce132e741f8a3a0743d47f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/6] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  4 +++
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1e8c22f94f..303210e326 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..21587c07ce 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                       const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From f21049aed514d5464d527d7dc51ace1fdac4e542 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/6] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..b0878a3dd9 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -328,6 +328,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -455,6 +458,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 81c6499251..8d3c45dd4e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2856,6 +2856,9 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -4105,6 +4108,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccea231e98..820f356038 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -538,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1761,7 +1763,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2924,7 +2926,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3069,10 +3071,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3318,7 +3318,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3523,6 +3523,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3799,6 +3811,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5068,7 +5081,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5342,6 +5355,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c9e35003a5..63a7653457 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 88a75fb798..3324be8a81 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -701,6 +701,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 88740269660d00d548910c2f3aa631878c7cf0d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:42:07 +0900
Subject: [PATCH 4/6] Allow dsm to use on postmaster.

DSM is inhibited to be used on postmaster. Shared memory baesd stats
collector needs it to work on postmaster and no problem found to do
that. Just allow it.
---
 src/backend/storage/ipc/dsm.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/ipc/dsm.c b/src/backend/storage/ipc/dsm.c
index 23ccc59f13..d30a876bb0 100644
--- a/src/backend/storage/ipc/dsm.c
+++ b/src/backend/storage/ipc/dsm.c
@@ -440,8 +440,7 @@ dsm_create(Size size, int flags)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_control != NULL);
 
     if (!dsm_init_done)
         dsm_backend_startup();
@@ -537,8 +536,7 @@ dsm_attach(dsm_handle h)
     uint32        i;
     uint32        nitems;
 
-    /* Unsafe in postmaster (and pointless in a stand-alone backend). */
-    Assert(IsUnderPostmaster);
+    Assert(dsm_control != NULL);
 
     if (!dsm_init_done)
         dsm_backend_startup();
-- 
2.16.3

From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 5/6] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5439 +++++++++++---------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    2 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  469 +--
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2485 insertions(+), 3577 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0e73cdcdda..339afb2de9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 347f91e937..92b23ccda0 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8d3c45dd4e..61b0bd161d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,19 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
- *
- *            - Add some automatic call for pgstat vacuuming.
- *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  Collects per-table and per-function usage statistics of backends and shares
+ *  them among all backends via shared memory. Every backend records
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_report_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared
+ *  memory. To avoid congestion on the shared memory, we do that not often
+ *  than PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, such numbers are
+ *  postponed to the next chances with the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,88 +23,47 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
-#include "access/transam.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -116,6 +79,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define    PGSTAT_FETCH_SHARED        0
+#define    PGSTAT_FETCH_EXCLUSIVE    1
+#define    PGSTAT_FETCH_NOWAIT        2
+
+typedef enum PgStat_TableLookupState
+{
+    PGSTAT_ENTRY_NOT_FOUND,
+    PGSTAT_ENTRY_FOUND,
+    PGSTAT_ENTRY_LOCK_FAILED
+} PgStat_TableLookupState;
 
 /* ----------
  * GUC parameters
@@ -131,9 +107,23 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+LWLock        StatsMainLock;
+#define        StatsLock (&StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_hash_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+    TimestampTz last_update;
+} StatsShmemStruct;
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
@@ -145,17 +135,38 @@ PgStat_MsgBgWriter BgWriterStats;
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash;
+static MemoryContext pgSharedStatsContext = NULL;
 
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -190,8 +201,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -201,6 +212,46 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* dbentry has some additional data in snapshot */
+typedef struct PgStat_StatDBEntry_snapshot
+{
+    PgStat_StatDBEntry shared_part;
+
+    HTAB *snapshot_tables;                /* table entry snapshot */
+    HTAB *snapshot_functions;            /* function entry snapshot */
+    dshash_table    *dshash_tables;        /* attached tables dshash */
+    dshash_table    *dshash_functions;    /* attached functions dshash */
+} PgStat_StatDBEntry_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_cxt
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    HTAB          **hash;                /* placeholder for the hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table  **dshash;                /* placeholder for attached dshash */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+} pgstat_snapshot_cxt;
+
+/*
+ *  Backends store various database-wide info that's waiting to be flushed out
+ *  to shared memory in these variables.
+ */
+static int        n_deadlocks = 0;
+static size_t    n_tmpfiles = 0;
+static size_t    n_tmpfilesize = 0;
+
+/*
+ * have_recovery_conflicts represents the existence of any kind if conflict
+ */
+static bool        have_recovery_conflicts = false;
+static int        n_conflict_tablespace = 0;
+static int        n_conflict_lock = 0;
+static int        n_conflict_snapshot = 0;
+static int        n_conflict_bufferpin = 0;
+static int        n_conflict_startup_deadlock = 0;
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,36 +287,41 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot. The snapshot includes auxiliary. */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
-
-/* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
-
-/* Total number of backends including auxiliary */
 static int    localNumBackends = 0;
 
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+/* Variables for activity statistics snapshot. */
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatDBEntrySnapshot;
+static TimestampTz snapshot_expires_at = 0; /* local cache expiration time */
+static bool        first_in_xact = true;      /* is the first time in this xact? */
+
+
+/* Context struct for flushing to shared memory */
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by shared statistics code and
+ * snapshot_* are backend snapshots.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -279,35 +335,36 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupState *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_miscstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -315,480 +372,133 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
- */
-void
-pgstat_init(void)
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* trash the stats on crash */
+    if (code == 0)
+        pgstat_write_statsfiles();
 }
 
-/*
- * subroutine for pgstat_reset_all
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+        /* Load saved data if any */
+        pgstat_read_statsfiles();
+
+        /* need to be called before dsm shutodwn */
+        before_shmem_exit(pgstat_postmaster_shutdown, (Datum) 0);
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_create_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+    /* lives for the lifetime of the process */
+    oldcontext = MemoryContextSwitchTo(pgSharedStatsContext);
+    
+    area = dsa_create(LWTRANCHE_STATS);
+    dsa_pin_mapping(area);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    /* create the database hash */
+    pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_hash_handle =
+        dshash_get_hash_table_handle(pgStatDBHash);
+    StatsShmem->last_update = 0;
+
+    /* initial connect to the memory */
+    pgStatDBEntrySnapshot = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Clear on-memory counters.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    dshash_seq_status dshstat;
+    PgStat_StatDBEntry           *dbentry;
 
-#ifdef EXEC_BACKEND
+    Assert (pgStatDBHash);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
+        /*
+         * Reset database-level stats, too.  This creates empty hash tables
+         * for tables and functions.
+         */
+        reset_dbentry_counters(dbentry);
     }
 
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    /*
+     * Reset global counters
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+    MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+    shared_globalStats->stat_reset_timestamp =
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    LWLockRelease(StatsLock);
 }
 
 /* ------------------------------------------------------------
@@ -796,75 +506,262 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_report_stat() -
+ * pgstat_flush_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up by calling
+ *    this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *    has elapsed since last cleanup. On the other hand updates by regular
+ *    backends happen with the interval not shorter than
+ *    PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns the time until the next update time in milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz last_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        have_other_stats = false;
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+
+    /* Do we have anything to flush? */
+    if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
+        have_other_stats = true;
 
     /* Don't expend a clock check if nothing to do */
     if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+        !have_other_stats && !have_function_stats)
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's been at least
+         * PGSTAT_STAT_MIN_INTERVAL msec since the last flush.  Returns time
+         * to wait in the case.
+         */
+        TimestampDifference(last_flush, now, &secs, &usecs);
+        elapsed = secs * 1000 + usecs /1000;
+
+        if(elapsed < PGSTAT_STAT_MIN_INTERVAL)
+        {
+            if (pending_since == 0)
+                pending_since = now;
+
+            return PGSTAT_STAT_MIN_INTERVAL - elapsed;
+        }
+
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* It's the time to flush */
+    last_flush = now;
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out miscellaneous stats */
+    if (have_other_stats && !pgstat_flush_miscstats(&cxt, !force))
+        pending_stats = true;
+
+    /*  Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->last_update < last_flush)
+        StatsShmem->last_update = last_flush;
+    LWLockRelease(StatsLock);
+
+    /* record how long we keep pending stats */
+    if (pending_stats)
+    {
+        if (pending_since == 0)
+            pending_since = now;
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    pending_since = 0;
+
+    return 0;
+}
+
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash with cache.
+ *
+ * Returns the entry for key or NULL if not found.
+ *
+ * Returned entries are consistent during the current transaction or
+ * pgstat_clear_snapshot() is called.
+ *
+ * *cxt->hash points to a HTAB* variable to store the hash for local cache. New
+ * one is created if it is not yet created.
+ *
+ * *cxt->dshash points to dshash_table* variable to store the attached
+ * dshash. *cxt->dsh_handle is * attached if not yet attached.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
+{
+    char *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
+    bool *negative;
+
+    /* caches the result entry */
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * Create new hash with arbitrary initial entries since we don't know how
+     * this hash will grow. The boolean put at the end of the entry is
+     * negative flag.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /* Create the hash in the stats context */
+        ctl.keysize        = keysize;
+        ctl.entrysize    = cxt->hash_entsize + sizeof(bool);
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    negative = (bool *) (lentry + cxt->hash_entsize);
+
+    if (!found)
+    {
+        /* not found in local cache, search shared hash */
+
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            if (cxt->dsh_handle == DSM_HANDLE_INVALID)
+                return NULL;
+
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /* found copy it */
+            memcpy(lentry, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the additional space */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet(lentry + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        *negative = !sentry;
+    }
+
+    if (*negative)
+        return NULL;
+
+    return (void *) lentry;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns with false if required lock was not acquired
+ *  immediately. In the case, infos of some tables may be left alone in TSA to
+ *  wait for the next chance. cxt holds some dshash related values that we
+ *  want to keep during the shared stats update.  Returns true if no stats
+ *  info remains. Caller must detach dshashes stored in cxt after use.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to. We recreate it if is needed.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared statistics.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -877,178 +774,344 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Leave it alone filling at the beginning in TSA.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment. There must be
+                 * enough space segments since we are just leaving some of the
+                 * current elements.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /* Move the entry if needed */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TSA hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries , but still may
+     * use that for my database.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_FETCH_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_FETCH_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /* We don't have corresponding dbentry here */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold dshash-lock on dbentries, since the dbentries cannot
+         * be dropped meanwhile.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We attach mydb tabhash once per flushing. This is the chance to
+             * update database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * dbentry is always available here, so try flush table stats first, then
+     * database stats.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure and leave some of the
+ *  entries alone in the local hash.
+ *
+ *  Returns true if all entries are flushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_FETCH_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared statistics.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_miscstats: Flushes out miscellaneous stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all the miscellaneous stats are flushed out.
+ */
+static bool
+pgstat_flush_miscstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_FETCH_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_FETCH_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* Lock failure, return. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (have_recovery_conflicts)
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* If not done for this transaction, take a snapshot of stats */
+    pgstat_snapshot_global_stats();
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1056,137 +1119,77 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshtable = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if (hash_search(oidtab, (void *) &tabid, HASH_FIND, NULL) != NULL)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* Not there, so purge this table */
+        dshash_delete_entry(dshtable, tabentry);
     }
+    dshash_detach(dshtable);
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+        dshtable =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&dshstat, dshtable, false, true);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&dshstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
             CHECK_FOR_INTERRUPTS();
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            if (hash_search(oidtab, (void *) &funcid, HASH_FIND, NULL) != NULL)
                 continue;
 
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
+            /* Not there, so remove this function */
+            dshash_delete_entry(dshtable, funcentry);
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
+        hash_destroy(oidtab);
 
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        dshash_detach(dshtable);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1244,62 +1247,57 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    Assert (OidIsValid(databaseid));
+    Assert(pgStatDBHash);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+        Assert(dbentry->refcnt == 0);
+
+        /* One one must live on this database. It's safe to drop all. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1308,20 +1306,31 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(pgStatDBHash);
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1330,29 +1339,35 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,17 +1376,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_FETCH_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1385,48 +1425,83 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1437,9 +1512,15 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
     /*
@@ -1468,114 +1549,217 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_FETCH_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    have_recovery_conflicts = true;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += n_conflict_tablespace;
+    dbentry->n_conflict_lock         += n_conflict_lock;
+    dbentry->n_conflict_snapshot    += n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += n_conflict_startup_deadlock;
+
+    n_conflict_tablespace = 0;
+    n_conflict_lock = 0;
+    n_conflict_snapshot = 0;
+    n_conflict_bufferpin = 0;
+    n_conflict_startup_deadlock = 0;
+
+    have_recovery_conflicts = false;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_deadlock(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush dead lock stats
+ */
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += n_deadlocks;
+    n_deadlocks = 0;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupState status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    Assert(pgStatDBHash);
+
+    if (!pgstat_track_counts || !IsUnderPostmaster)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        n_tmpfilesize += filesize; /* needs check overflow */
+        n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_FETCH_EXCLUSIVE | PGSTAT_FETCH_NOWAIT,
+                                  &status);
+
+    if (status == PGSTAT_ENTRY_LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += n_tmpfilesize;
+    dbentry->n_temp_files += n_tmpfiles;
+    n_tmpfilesize = 0;
+    n_tmpfiles = 0;
+
 }
 
-
 /*
  * Initialize function call usage data.
  * Called by the executor before invoking a function.
@@ -1726,7 +1910,7 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (!pgstat_track_counts || !IsUnderPostmaster)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1745,6 +1929,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1760,18 +1962,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2125,8 +2316,8 @@ AtEOXact_PgStat(bool isCommit)
     }
     pgStatXactStack = NULL;
 
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
 }
 
 /* ----------
@@ -2307,8 +2498,8 @@ PostPrepare_PgStat(void)
     }
     pgStatXactStack = NULL;
 
-    /* Make sure any stats snapshot is thrown away */
-    pgstat_clear_snapshot();
+    /* mark as the next reference is the first in a transaction */
+    first_in_xact = true;
 }
 
 /*
@@ -2380,30 +2571,37 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "local database stats hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    cxt.dshash = &pgStatDBHash;
+    cxt.hash = &pgStatDBEntrySnapshot;
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(&cxt, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2416,51 +2614,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_tables;
+    cxt.dshash = &local_dbent->dshash_tables;
+    cxt.dsh_handle = dbent->tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(&cxt, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2475,21 +2688,125 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_cxt cxt =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash = NULL,
+        .hash_entsize = sizeof(PgStat_StatDBEntry_snapshot),
+        .dshash = NULL,
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_dbparams
+    };
+    PgStat_StatDBEntry_snapshot *local_dbent;
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* dbent given to this function is alias of PgStat_StatDBEntry_snapshot */
+    local_dbent = (PgStat_StatDBEntry_snapshot *)dbent;
+    cxt.hash = &local_dbent->snapshot_functions;
+    cxt.dshash = &local_dbent->dshash_functions;
+    cxt.dsh_handle = dbent->functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(&cxt, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    TimestampTz update_time = 0;
+
+    /* The snapshot lives within CacheMemoryContext */
+    if (pgStatSnapshotContext == NULL)
+    {
+        pgStatSnapshotContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Stats snapshot context",
+                                  ALLOCSET_DEFAULT_SIZES);
+    }
+
+    /*
+     * Shared stats are updated frequently especially when many backends are
+     * running, but we don't want to reconstruct snapshot so frequently for
+     * performance reasons. Keep them at least for the same duration with
+     * minimal stats update interval of a backend. As the result snapshots may
+     * live for multiple transactions.
+     */
+    if (first_in_xact && IsTransactionState())
+    {
+        first_in_xact = false;
+        LWLockAcquire(StatsLock, LW_SHARED);
+        update_time = StatsShmem->last_update;
+        LWLockRelease(StatsLock);
+
+        if (snapshot_expires_at < update_time)
+        {
+            /* No problem to expire involving backend status */
+            pgstat_clear_snapshot();
+
+            snapshot_expires_at =
+                update_time + PGSTAT_STAT_MIN_INTERVAL * USECS_PER_SEC / 1000;
+        }
+    }
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    /* set the timestamp of this snapshot */
+    snapshot_globalStats->stats_timestamp = update_time;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2561,9 +2878,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2578,9 +2896,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2769,8 +3088,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2856,9 +3175,6 @@ pgstat_bestart(void)
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
-            case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
-                break;
             case CheckpointerProcess:
                 beentry->st_backendType = B_CHECKPOINTER;
                 break;
@@ -3231,7 +3547,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3498,9 +3815,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4151,75 +4465,39 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4228,6 +4506,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4236,11 +4516,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4249,298 +4536,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge((PgStat_MsgTabpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb((PgStat_MsgDropdb *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter((PgStat_MsgResetcounter *) &msg,
-                                             len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   (PgStat_MsgResetsharedcounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   (PgStat_MsgResetsinglecounter *) &msg,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac((PgStat_MsgAutovacStart *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum((PgStat_MsgVacuum *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze((PgStat_MsgAnalyze *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver((PgStat_MsgArchiver *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter((PgStat_MsgBgWriter *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat((PgStat_MsgFuncstat *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict((PgStat_MsgRecoveryConflict *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock((PgStat_MsgDeadlock *) &msg, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile((PgStat_MsgTempFile *) &msg, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    need_exit = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
+    return generation;
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /*
+     * It is isolated, waiting for all referrers to end.
+     */
+    Assert(dbentry->generation == generation + 1);
+
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
+}
+
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
+}
+
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
+
+
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+}
+
+/*
+ * Subroutine to reset stats in a shared database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * All counters are reset. Tables and functions dshashes are destroyed.  If
+ * any backend is pinning this dbentry, the current dshashes are stashed out to
+ * the previous "generation" to wait for all accessors gone. If the previous
+ * generation is already occupied, the current dshashes are so fresh that they
+ * doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4563,23 +4716,808 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Removing individual
+         * entries in dshash is very costly so just destroy it.  If someone
+         * pined this entry just after, pin_hashes returns the current
+         * generation and attach waits for the following LWLock.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Recreate now if needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
+}
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+/*
+ * Create the filename for a DB stat file; filename is the output buffer, of
+ * length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    /*
+     * local cache lives in pgSharedStatsContext.
+     */
+    pgstat_setup_memcxt();
+
+    /*
+     * Create the DB hashtable and global stats area. No lock is needed since
+     * we're alone now.
+     */
+    pgstat_create_shared_stats();
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * In the collector, disregard the timestamp we read from the permanent
+     * stats file; we should be willing to write a temp stats file immediately
+     * upon the first request from any backend.  This only matters if the old
+     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
+     * an unusual scenario.
+     */
+    shared_globalStats->stats_timestamp = 0;
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry   *tabentry;
+    PgStat_StatTabEntry        tabbuf;
+    PgStat_StatFuncEntry    funcbuf;
+    PgStat_StatFuncEntry   *funcentry;
+    dshash_table           *tabhash = NULL;
+    dshash_table           *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    /* should be called from postmaster  */
+    Assert(!IsUnderPostmaster);
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgSharedStatsContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+    if (!pgSharedStatsContext)
+        pgSharedStatsContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Shared activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+    {
+        MemoryContextReset(pgStatSnapshotContext);
+
+        /* mark as the resource are not allocated */
+        snapshot_globalStats = NULL;
+        snapshot_archiverStats = NULL;
+        pgStatDBEntrySnapshot = NULL;
+    }
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
@@ -4588,47 +5526,79 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
  * Else, return NULL.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupState *status)
 {
     PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    bool        nowait = ((op & PGSTAT_FETCH_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
+    if (!IsUnderPostmaster)
         return NULL;
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    Assert(pgStatDBHash);
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_FETCH_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = PGSTAT_ENTRY_LOCK_FAILED;
+        }
+        else if (!found)
+            *status = PGSTAT_ENTRY_NOT_FOUND;
+        else
+            *status = PGSTAT_ENTRY_FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4661,1685 +5631,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 820f356038..369d6dde63 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1302,12 +1300,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1756,11 +1748,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2595,8 +2582,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2927,8 +2912,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -2995,13 +2978,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3076,22 +3052,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3550,22 +3510,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3761,8 +3705,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3801,8 +3743,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4003,8 +3944,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -4977,18 +4916,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5101,12 +5028,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -5976,7 +5897,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6029,8 +5949,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6262,7 +6180,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..97bca9be24 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -281,8 +282,13 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45ae5..004cb26f63 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..fc265a507e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4225,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4233,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index fd51934aaf..994351ac2d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..5e67b25e18 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..8939758c59 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 63a7653457..c67331138b 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3324be8a81..cb70d00b8f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,32 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -115,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +148,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +166,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to apply.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when applying to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,81 +205,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -568,6 +215,12 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
+typedef struct PgStat_DSHash
+{
+    int        refcnt;
+    dshash_table_handle handle;
+} PgStat_DSHash;
+
 /* ----------
  * PgStat_StatDBEntry            The collector's data per database
  * ----------
@@ -597,17 +250,22 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats  update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
 } PgStat_StatDBEntry;
 
-
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
  * ----------
@@ -645,7 +303,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -660,7 +318,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -676,7 +334,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -762,7 +420,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1146,6 +803,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1167,31 +826,29 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern void pgstat_initialize(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
-extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
-
+extern void pgstat_reset_shared_counters(const char *target);
+extern void pgstat_reset_single_counter(Oid objectid,
+                                        PgStat_Single_Reset_Type type);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1202,26 +859,10 @@ extern void pgstat_report_analyze(Relation rel,
 extern void pgstat_report_recovery_conflict(int reason);
 extern void pgstat_report_deadlock(void);
 
-extern void pgstat_initialize(void);
 extern void pgstat_bestart(void);
-
 extern void pgstat_report_activity(BackendState state, const char *cmd_str);
-extern void pgstat_report_tempfile(size_t filesize);
-extern void pgstat_report_appname(const char *appname);
-extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
-extern const char *pgstat_get_wait_event(uint32 wait_event_info);
-extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
-extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
-extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
-                                    int buflen);
-extern const char *pgstat_get_backend_desc(BackendType backendType);
 
-extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
-                              Oid relid);
-extern void pgstat_progress_update_param(int index, int64 val);
-extern void pgstat_progress_update_multi_param(int nparam, const int *index,
-                                   const int64 *val);
-extern void pgstat_progress_end_command(void);
+extern void pgstat_report_tempfile(size_t filesize);
 
 extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
@@ -1351,18 +992,38 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 
+
+
+extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid relid);
+
+extern void pgstat_report_appname(const char *appname);
+extern void pgstat_report_xact_timestamp(TimestampTz tstamp);
+extern const char *pgstat_get_wait_event(uint32 wait_event_info);
+extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
+extern const char *pgstat_get_backend_current_activity(int pid, bool checkUser);
+extern const char *pgstat_get_crashed_backend_activity(int pid, char *buffer,
+                                    int buflen);
+extern const char *pgstat_get_backend_desc(BackendType backendType);
+
+extern void pgstat_progress_start_command(ProgressCommandType cmdtype,
+                              Oid relid);
+extern void pgstat_progress_update_param(int index, int64 val);
+extern void pgstat_progress_update_multi_param(int nparam, const int *index,
+                                   const int64 *val);
+extern void pgstat_progress_end_command(void);
 /* ----------
  * Support functions for the SQL-callable functions to
  * generate the pgstat* views.
  * ----------
  */
-extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
+extern void pgstat_clear_snapshot(void);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
-extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
-extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
 
+extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
+extern int    pgstat_fetch_stat_numbackends(void);
+extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732006..e2277d67a3 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TUPLESTORE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From f8474ee814ebe29d4b8ea879d5300d88fe24ce98 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 6/6] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..95285809c2 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1119,8 +1119,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b..79f704cc99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6709,25 +6709,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 339afb2de9..eb6e0eecdd 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index cbdad0c3fb..133eb3ff19 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61b0bd161d..6f65d707ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -102,15 +102,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 LWLock        StatsMainLock;
 #define        StatsLock (&StatsMainLock)
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index def6c03dd0..58ba33e822 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85..058d97075f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -191,7 +191,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3988,17 +3987,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11002,35 +10990,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f312096..fdb088dbfd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -554,7 +554,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index cb70d00b8f..265ed4d118 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Arthur Zakirov
Дата:
On 25.02.2019 07:52, Kyotaro HORIGUCHI wrote:
> It is fixed by moving StartTransactionCommand to before the first
> pgstat_f_s_dbentry(), which looks better not having this problem.

Thank you. Still there are couple TAP-test which don't pass: 
002_archiving.pl and 010_pg_basebackup.pl. I think the next simple patch 
solves the issue:


diff --git a/src/backend/postmaster/pgstat.c 
b/src/backend/postmaster/pgstat.c
index f9b22a4d71..d500f9d090 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3169,6 +3169,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 beentry->st_backendType = B_STARTUP;
                 break;
+           case ArchiverProcess:
+               beentry->st_backendType = B_ARCHIVER;
+               break;
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl 
b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 8939758c59..4f656c98a3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
  use File::Path qw(rmtree);
  use PostgresNode;
  use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;

  program_help_ok('pg_basebackup');
  program_version_ok('pg_basebackup');


010_pg_basebackup.pl has 105 tests now because pg_stat_tmp dir was 
removed from the `foreach my $dirname` loop.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: shared-memory based stats collector

От
Robert Haas
Дата:
On Sun, Feb 24, 2019 at 11:53 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> The two aboves are fixed in the attached v17.

Andres just drew my attention to patch 0004 in this series, which is
definitely not OK.  That patch allows the postmaster to use dynamic
shared memory, claiming: "Shared memory baesd stats collector needs it
to work on postmaster and no problem found to do that. Just allow it."

But if you just look a little further down in the code from where that
assertion is located, you'll find this:

    /* Lock the control segment so we can register the new segment. */
    LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);

It is a well-established principle that the postmaster must not
acquire any locks, because if it did, a corrupted shared memory
segment could take down not only individual backends but the
postmaster as well.  So this is entirely not OK in the postmaster.  I
think there might be other reasons as well why this is not OK that
aren't occurring to me at the moment, but that one is enough by
itself.

But even if for some reason that were OK, I'm pretty sure that any
design that involves the postmaster interacting with the data stored
in shared memory by the stats collector is an extremely bad idea.
Again, the postmaster is supposed to have as little interaction with
shared memory as possible, precisely so that it is doesn't crash and
burn when some other process corrupts shared memory.  Dynamic shared
memory is included in that.  So, really, the LWLock here is just the
tip of the iceberg: the postmaster not only CAN'T safely run this
code, but it shouldn't WANT to do so.

And I'm kind of baffled that it does.  I haven't looked at the other
patches, but it seems to me that, while a shared-memory stats
collector is a good idea in general to avoid the I/O and CPU costs of
with reading and writing temporary files, I don't see why the
postmaster would need to be involved in any of that.  Whatever the
reason, though, I'm pretty sure that's GOT to be changed for this
patch set to have any chance of being accepted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

> From 88740269660d00d548910c2f3aa631878c7cf0d4 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 21 Feb 2019 12:42:07 +0900
> Subject: [PATCH 4/6] Allow dsm to use on postmaster.
> 
> DSM is inhibited to be used on postmaster. Shared memory baesd stats
> collector needs it to work on postmaster and no problem found to do
> that. Just allow it.

Maybe I'm missing something, but why? postmaster doesn't actually need
to process stats messages in any way?


> From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 5/6] Shared-memory based stats collector
> 
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

Btw, you can make the life of a committer easier by collecting the
reviewers and co-authors of a patch yourself...


This desparately needs an introductory comment in pgstat.c or such
explaining how the new scheme works.



> +LWLock        StatsMainLock;
> +#define        StatsLock (&StatsMainLock)

Wait, what? You can't just define a lock this way. That's process local
memory, locking that doesn't do anything useful.


> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct {

Please note that in PG's coding style the { comes in the next line.


> +/*
> + *  Backends store various database-wide info that's waiting to be flushed out
> + *  to shared memory in these variables.
> + */
> +static int        n_deadlocks = 0;
> +static size_t    n_tmpfiles = 0;
> +static size_t    n_tmpfilesize = 0;
> +
> +/*
> + * have_recovery_conflicts represents the existence of any kind if conflict
> + */
> +static bool        have_recovery_conflicts = false;
> +static int        n_conflict_tablespace = 0;
> +static int        n_conflict_lock = 0;
> +static int        n_conflict_snapshot = 0;
> +static int        n_conflict_bufferpin = 0;
> +static int        n_conflict_startup_deadlock = 0;

Probably worthwhile to group those into a struct, even just to make
debugging easier.



>  
> -/* ----------
> - * pgstat_init() -
> - *
> - *    Called from postmaster at startup. Create the resources required
> - *    by the statistics collector process.  If unable to do so, do not
> - *    fail --- better to let the postmaster start with stats collection
> - *    disabled.
> - * ----------
> - */
> -void
> -pgstat_init(void)
> +static void
> +pgstat_postmaster_shutdown(int code, Datum arg)

You can't have a function like that without explaining why it's there.

> +    /* trash the stats on crash */
> +    if (code == 0)
> +        pgstat_write_statsfiles();
>  }

And especially not without documenting what that code is supposed to
mean.



>  pgstat_report_stat(bool force)
>  {
> -    /* we assume this inits to all zeroes: */
> -    static const PgStat_TableCounts all_zeroes;
> -    static TimestampTz last_report = 0;
> -
> +    static TimestampTz last_flush = 0;
> +    static TimestampTz pending_since = 0;
>      TimestampTz now;
> -    PgStat_MsgTabstat regular_msg;
> -    PgStat_MsgTabstat shared_msg;
> -    TabStatusArray *tsa;
> -    int            i;
> +    pgstat_flush_stat_context cxt = {0};
> +    bool        have_other_stats = false;
> +    bool        pending_stats = false;
> +    long        elapsed;
> +    long        secs;
> +    int            usecs;
> +
> +    /* Do we have anything to flush? */
> +    if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
> +        have_other_stats = true;
>  
>      /* Don't expend a clock check if nothing to do */
>      if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
>          pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
> -        !have_function_stats)
> -        return;
> +        !have_other_stats && !have_function_stats)
> +        return 0;

"other" seems like a pretty mysterious category. Seems better to either
name precisely, or just use the underlying variables for checks.




> +/* -------
> + * Subroutines for pgstat_flush_stat.
> + * -------
> + */
> +
> +/*
> + * snapshot_statentry() - Find an entry from source dshash with cache.
> + *

Is snapshot_statentry() really a subroutine for pgstat_flush_stat()?

> +static void *
> +snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
> +{
> +    char *lentry = NULL;
> +    size_t keysize = cxt->dsh_params->key_size;
> +    size_t dsh_entrysize = cxt->dsh_params->entry_size;
> +    bool found;
> +    bool *negative;
> +
> +    /* caches the result entry */
>  
>      /*
> -     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> -     * msec since we last sent one, or the caller wants to force stats out.
> +     * Create new hash with arbitrary initial entries since we don't know how
> +     * this hash will grow. The boolean put at the end of the entry is
> +     * negative flag.
>       */

That, uh, seems pretty ugly and hard to understand.


> +/*
> + * pgstat_flush_stat: Flushes table stats out to shared statistics.
> + *
> + *  If nowait is true, returns with false if required lock was not acquired

s/with false/false/

> + *  immediately. In the case, infos of some tables may be left alone in TSA to

TSA? I assume TabStatusArray, but I don't think that's a common or
useful abbreviation. It'd be ok to just refer to the variable name.

> +static bool
> +pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
> +{

> +            /* try to apply the tab stats */
> +            if (!pgstat_flush_tabstat(cxt, nowait, entry))
>              {
> -                pgstat_send_tabstat(this_msg);
> -                this_msg->m_nentries = 0;
> +                /*
> +                 * Failed. Leave it alone filling at the beginning in TSA.
> +                 */
> +                TabStatHashEntry *hash_entry;
> +                bool found;
> +
> +                if (new_tsa_hash == NULL)
> +                    new_tsa_hash = create_tabstat_hash();
> +
> +                /* Create hash entry for this entry */
> +                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
> +                                         HASH_ENTER, &found);
> +                Assert(!found);
> +
> +                /*
> +                 * Move insertion pointer to the next segment. There must be
> +                 * enough space segments since we are just leaving some of the
> +                 * current elements.
> +                 */
> +                if (dest_elem >= TABSTAT_QUANTUM)
> +                {
> +                    Assert(dest_tsa->tsa_next != NULL);
> +                    dest_tsa = dest_tsa->tsa_next;
> +                    dest_elem = 0;
> +                }
> +
> +                /* Move the entry if needed */
> +                if (tsa != dest_tsa || i != dest_elem)
> +                {
> +                    PgStat_TableStatus *new_entry;
> +                    new_entry = &dest_tsa->tsa_entries[dest_elem];
> +                    *new_entry = *entry;
> +                    entry = new_entry;
> +                }
> +
> +                hash_entry->tsa_entry = entry;
> +                dest_elem++;

This seems a lot of work for just leaving an entry around to be
processed later. Shouldn't code for that already exist elsewhere?

>  void
>  pgstat_vacuum_stat(void)
>  {
> -    HTAB       *htab;
> -    PgStat_MsgTabpurge msg;
> -    PgStat_MsgFuncpurge f_msg;
> -    HASH_SEQ_STATUS hstat;
> +    HTAB       *oidtab;
> +    dshash_table *dshtable;
> +    dshash_seq_status dshstat;
>      PgStat_StatDBEntry *dbentry;
>      PgStat_StatTabEntry *tabentry;
>      PgStat_StatFuncEntry *funcentry;
> -    int            len;
>  
> -    if (pgStatSock == PGINVALID_SOCKET)
> +    /* we don't collect statistics under standalone mode */
> +    if (!IsUnderPostmaster)
>          return;
>  
> -    /*
> -     * If not done for this transaction, read the statistics collector stats
> -     * file into some hash tables.
> -     */
> -    backend_read_statsfile();
> +    /* If not done for this transaction, take a snapshot of stats */
> +    pgstat_snapshot_global_stats();

Hm, why do we need a snapshot here?



>      /*
>       * Now repeat the above steps for functions.  However, we needn't bother
>       * in the common case where no function stats are being collected.
>       */

Can't we move the act of iterating through these hashes and probing
against another hash into a helper function and reuse? These
duplications aren't pretty.


Greetings,

Andres Freund


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Ping?  Unless there's a new version pretty soon, we're going to have to
move this to the next CF, I think.


Re: Re: shared-memory based stats collector

От
David Steele
Дата:
On 3/22/19 10:33 PM, Andres Freund wrote:
> Ping?  Unless there's a new version pretty soon, we're going to have to
> move this to the next CF, I think.

Agreed.  I've marked it Waiting on Author for now and will move it to 
PG13 on the 28th if no new patch appears.

Regards,
-- 
-David
david@pgmasters.net


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

Unfortunately I don't think it's realistic to target this to v12. I
think it was unlikely to make at the beginning of the CF, but since then
development just wasn't quick enough to warrant aiming for it.  It's a
large and somewhat complex patch, and has some significant risks
associated. Therefore I think we should mark this as targeting v13, and
move to the next CF?

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 3 Apr 2019 12:56:59 -0700, Andres Freund <andres@anarazel.de> wrote in
<20190403195659.fcmk2i7ruxhtyqjl@alap3.anarazel.de>
> Unfortunately I don't think it's realistic to target this to v12. I
> think it was unlikely to make at the beginning of the CF, but since then
> development just wasn't quick enough to warrant aiming for it.  It's a
> large and somewhat complex patch, and has some significant risks
> associated. Therefore I think we should mark this as targeting v13, and
> move to the next CF?

I'd like to get this in 12 but actually the time is coming. If
everyone think that that is not realistic, I do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On Thu, Apr 04, 2019 at 09:25:12AM +0900, Kyotaro HORIGUCHI wrote:
>Hello.
>
>At Wed, 3 Apr 2019 12:56:59 -0700, Andres Freund <andres@anarazel.de> wrote in
<20190403195659.fcmk2i7ruxhtyqjl@alap3.anarazel.de>
>> Unfortunately I don't think it's realistic to target this to v12. I
>> think it was unlikely to make at the beginning of the CF, but since then
>> development just wasn't quick enough to warrant aiming for it.  It's a
>> large and somewhat complex patch, and has some significant risks
>> associated. Therefore I think we should mark this as targeting v13, and
>> move to the next CF?
>
>I'd like to get this in 12 but actually the time is coming. If
>everyone think that that is not realistic, I do that.
>

Unfortunately, now that we're past code freeze it's clear this is a PG12
matter now :-(

I personally consider this to be very worthwhile & beneficial improvement,
but I agree with Andres the patch did not quite get to committable state
in the last CF. Conidering how sensitive part it touches, I suggest we try
to get it committed early in the PG13 cycle. I'm willing to spend some
time on doing test/benchmarks and reviewing the code, if needed.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<20190409150333.5iashyjxm5jmraml@development>
> Unfortunately, now that we're past code freeze it's clear this is a
> PG12
> matter now :-(
> 
> I personally consider this to be very worthwhile & beneficial
> improvement,
> but I agree with Andres the patch did not quite get to committable
> state
> in the last CF. Conidering how sensitive part it touches, I suggest we
> try
> to get it committed early in the PG13 cycle. I'm willing to spend some
> time on doing test/benchmarks and reviewing the code, if needed.

I'm very happy to be told that. Actually the code was a rush work
(mainly for reverting refactoring) and left some stupid
mistakes. I'm going through on the patch again and polish code.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector

От
Tomas Vondra
Дата:
On Wed, Apr 10, 2019 at 09:39:29AM +0900, Kyotaro HORIGUCHI wrote:
>At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<20190409150333.5iashyjxm5jmraml@development>
>> Unfortunately, now that we're past code freeze it's clear this is a
>> PG12
>> matter now :-(
>>
>> I personally consider this to be very worthwhile & beneficial
>> improvement,
>> but I agree with Andres the patch did not quite get to committable
>> state
>> in the last CF. Conidering how sensitive part it touches, I suggest we
>> try
>> to get it committed early in the PG13 cycle. I'm willing to spend some
>> time on doing test/benchmarks and reviewing the code, if needed.
>
>I'm very happy to be told that. Actually the code was a rush work
>(mainly for reverting refactoring) and left some stupid
>mistakes. I'm going through on the patch again and polish code.
>

While reviewing the patch I've always had issue with evaluating how it
behaves for various scenarios / workloads. The reviews generally did one
specific benchmark, but I find that unsatisfactory. I wonder whether if
we could develop a small set of more comprehensive workloads for this
patch (i.e. different numbers of objects, access patterns, ...).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Wed, 10 Apr 2019 11:13:27 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in
<20190410091327.fpnvjbuu74dzxizl@development>
> While reviewing the patch I've always had issue with evaluating how it
> behaves for various scenarios / workloads. The reviews generally did
> one
> specific benchmark, but I find that unsatisfactory. I wonder whether
> if
> we could develop a small set of more comprehensive workloads for this
> patch (i.e. different numbers of objects, access patterns, ...).

Indeed. I'm having difficulty also with catcache pruning
stuff. But I might have found a clue to that.

I took performance numbers after some amendment and polishment of
the patch.

I expected operf might work but it doesn't show meaningful
information with O2'ed binary. gprof slows binary to about one
third. But just running pgbench gave me rather stable numbers
(differently from catcache stuff..).

The numbers are tps for 300 minutes run and ratio between master
and patched.

[A-D]1 are just running stats-updator clients.

      master-O2     patched       patched/master-O2
A1:  13431.603208  13457.968950   100.1963
B1:  72284.737474  72535.169437   100.3465
C1:     19.963315     20.037411   100.3712
D1:    193.027074    196.651603   101.8777


[A-D]2 tests introduces stats-reader client.

          master-O2                patched               patched/master-O2
        updator /  reader       updator /  reader      updator /  reader
A2: 12929.418503/512.784200  13066.150297/584.686889    101.0575 / 114.0220
B2: 71673.804812/ 20.102687  71916.823242/ 22.109251    100.3391 / 109.9816
C2:    16.066719/485.788495     16.487942/577.930340    102.6217 / 118.9675
D2:   189.563306/ 36.252532    193.817075/ 44.661707    102.2440 / 123.1961


Case A1 is most simplest case: 1 client repeatedly updated stats
   of pgbench_acconts (of scale-1, but that doesn't matter)

Case B1 is A1 from 100 concurrent clients.

Case C1 is massive(?) number of stats update: Concretely select
  sum() on a partitioned table with 1000 children, from 1 client.

Case D1 doesn C1 from 97 concurrent clients.

A2-D2 are running a single stats referencing client while A1-D1
are running respectively. (select sum(seq_scan) from pg_stat_user_tables)

Perhaps the number will get worse having many rerefencing clients
but I think it's not realistic.


I'll run test with many databases (-100?) and expanded tabstat
entry cases.

The attached files are:

v19-0001-sequential-scan-for-dshash.patch:
v19-0002-Add-conditional-lock-feature-to-dshash.patch:
v19-0003-Make-archiver-process-an-auxiliary-process.patch:
v19-0005-Remove-the-GUC-stats_temp_directory.patch:

 not changed since v18 except rebasing.

v19-0004-Shared-memory-based-stats-collector.patch:

 Rebased. Fixed several bugs. Improved performance in some
 cases. Made structs/code tidier. Added/rewrote comments.

run.sh   : main test script
gencr.pl : partitioned table generator script generator
           (perl gencr.pl | psql postgres to run)
tr.sql   : stats-updator client script used by run.sh
ref.sql  : stats-reader client script used by run.sh


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1a97564fb9e392d0dfb067cb4056041d013f4f0e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..1e8c22f94f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 9a43133d64d763e860c027c32758074db6b7e3ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  4 +++
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1e8c22f94f..303210e326 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..21587c07ce 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                       const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 75c00b47b152c5aac3921f4e0d80707bf7a236ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 43627ab8f4..7872a2d9d7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 285def556b..ccb1d0b62e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 34315b8d1a..ec9a7ca311 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2977,7 +2979,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3122,10 +3124,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3371,7 +3371,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3576,6 +3576,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3848,6 +3860,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5117,7 +5130,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5400,6 +5413,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index b677c7e821..c33abee7aa 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 929987190c..f3d4cb5637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From a109c36aa6d2f02b8026322b995ffb60643e01aa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5613 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2591 insertions(+), 3619 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a179d6111e..0fe888e4db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index acd8a9280b..6d5c91dd45 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ccb1d0b62e..072f3ec62e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,24 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of backends and
+ *  shares them among all backends using shared memory. Every backend writes
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_flush_stat() is called at the
+ *  end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we do that not often than
+ *  PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, we preserve them
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process writes the shared stats to
+ *  file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +28,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +39,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +86,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +114,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * without needing to copy things around.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +205,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +216,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of stats entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception that it stores data for all databases.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    n_tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check n_tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +313,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    snapshot_cleard = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,24 +325,28 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 /* Total number of backends including auxiliary */
 static int    localNumBackends = 0;
 
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+/* Context struct for pgstat_flush_* functions */
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are not collected per database or per table.
+ * shared_* points to shared memroy and snapshot_* are backend snapshots.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +360,40 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,364 +401,160 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+
+    
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach to shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts, or when
+     * already attached.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (StatsShmem->refcount > 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* Statistics exists in shared memory. Just attach to it */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
     }
-    FreeDir(dir);
+    else
+    {
+        /* Need to create shared stats */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+    
+        /* Setup shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
+        StatsShmem->refcount = 0;
+    }
+
+    /* Setup local variables */
+    pgStatLocalHash = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+
+    dsa_pin_mapping(area);
+
+    /* Load data if we've just created the shared area. */
+    if (StatsShmem->refcount == 0)
+        pgstat_read_statsfiles();
+
+    StatsShmem->refcount++;
+
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file we're the last process if
+ *  write_stats is true.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1 && write_stats)
+        pgstat_write_statsfiles();
+
+    /*
+     * Detach the area. Automatically destroyed when the last processes
+     * detached it.
+     */
+    dsa_detach(area);
+
+    /* We're the last process. Invalidate the dsa area handle. */
+    if (StatsShmem->refcount < 1)
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    LWLockRelease(StatsLock);
 }
 
 /*
@@ -685,112 +566,18 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +585,285 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up in
+ *    PGSTAT_STAT_MAX_INTERVAL milliseconds by the final cleanup specified by
+ *    force = true. On the other hand updates by regular backends happen with
+ *    the interval not shorter than PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns the time until the next update time in milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's not the time yet.  Returns time to
+         * wait in seconds.
+         */
+        if (now < next_flush)
+        {
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / USECS_PER_SEC;
+        }
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /*  Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* record how long we keep pending stats */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval after "now" is later than the
+         * limit by PGSTAT_STAT_MAX_INTERVAL, but it's not so much. We don't
+         * bother that.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot entry for the key or NULL if not
+ *  found.
+ *
+ *  Returned entries are stable during the current transaction or until
+ *  pgstat_clear_snapshot() is called.
+ *
+ *  cxt->hash points to a variable that points to a HTAB to store snapshot
+ *  entries created by this function using hash_name, hash_entsize in cxt.
+
+ *
+ *  cxt->dshash is a working variable that points to a dshash_table storing
+ *  shared entryies. cxt->dsh_handle specifies the dshash to be attached.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Don't clear it for
+     * PGSTAT_STAT_MIN_INTERVAL ms.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (snapshot_cleard)
+    {
+        snapshot_cleard = false;
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats->stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+ 
+            /* Reset variables */
+            pgStatSnapshotContext = NULL;
+            snapshot_globalStats = NULL;
+            snapshot_archiverStats = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+    
+    /*
+     * Create new hash with arbitrary initial number entries since we don't
+     * know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The real size of this hash entry is given struct size plus common
+         * header part of PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = cxt->hash_entsize + offsetof(PgStat_snapshot, body);
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    if (!found)
+    {
+        /* not found in local cache, search shared hash */
+
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatLocalContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /* Found. Copy it */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired *
+ *  immediately. In the case, infos of some tables may be left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to keep during the shared stats update.  Returns true
+ *  if all stats info are flushed. Caller must detach dshashes stored in cxt
+ *  after use.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared statistics. We may fail
+     * to flush some entries in the array. Leaving the entries packing at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +876,340 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Leave it alone packing to the beginning in
+                 * TabStatusArray.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /* Move insertion pointer to the next segment. */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /* Move the entry if needed */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries , but still may
+     * use that for my database.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /* We don't have corresponding dbentry here */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold dshash-lock on dbentries, since the dbentries cannot
+         * be dropped meanwhile.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We attach mydb tabhash once per flushing. This is the chance to
+             * update database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * dbentry is always available here, so try flush table stats first, then
+     * database stats.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure and leave some of the
+ *  entries alone in the local hash.
+ *
+ *  Returns true if all entries are flushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared statistics.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all the miscellaneous stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* Lock failure, return. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1217,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1307,96 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    /* The first member of the entries must be Oid */
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+        Assert(dbentry->refcnt == 0);
+
+        /* One one must live on this database. It's safe to drop all. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1405,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1439,35 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1476,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1525,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1610,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1646,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1810,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.n_tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.n_tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.n_tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2108,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2128,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2161,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2774,34 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2814,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2888,93 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    oldcontext = MemoryContextSwitchTo(pgStatLocalContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3046,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3064,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2836,8 +3281,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3380,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3516,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3555,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3817,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4113,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4772,39 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4813,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4823,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4843,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * All counters are reset. Tables and functions dshashes are destroyed.  If
+ * any backend is pinning this dbentry, the current dshashes are stashed out to
+ * the previous "generation" to wait for all accessors gone. If the previous
+ * generation is already occupied, the current dshashes are so fresh that they
+ * doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5025,860 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Removing individual
+         * entries in dshash is very costly so just destroy it.  If someone
+         * pined this entry just after, pin_hashes returns the current
+         * generation and attach waits for the following LWLock.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Recreate now if needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is the output buffer, of
+ * length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        snapshot_cleard  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5911,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ec9a7ca311..02ce657daf 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1303,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1757,11 +1749,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2647,8 +2634,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2980,8 +2965,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3048,13 +3031,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3129,22 +3105,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3603,22 +3563,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3814,8 +3758,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3850,8 +3792,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4052,8 +3993,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5026,18 +4965,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5150,12 +5077,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6044,7 +5965,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6100,8 +6020,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6336,7 +6254,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d7d733530f..fdc0959624 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -275,8 +276,13 @@ CreateSharedMemoryAndSemaphores(int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc1aa88322..b9c33d6044 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..2f9dd19ab6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3152,6 +3152,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3726,6 +3732,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4166,9 +4173,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4203,7 +4218,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4211,6 +4226,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a5950c1e8c..b24631b7b1 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e9f72b5069..731ef0e27c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -633,6 +634,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..4f656c98a3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c33abee7aa..6aa9e3c121 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f3d4cb5637..2196b0bd38 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +52,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +89,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,236 +147,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -423,38 +165,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +204,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz    m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -615,16 +245,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -663,7 +306,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +321,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +337,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -780,7 +423,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1215,6 +857,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1236,29 +880,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 08e0dc8144..30d5fb63c5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 66ce074a20a0feae71333555c037d553e6e57929 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b67da8916a..f0daffbc93 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1145,8 +1145,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..6ad619de47 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6827,25 +6827,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0fe888e4db..31fc3604ea 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index c4bac87e80..c264f839e7 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 072f3ec62e..cec1ece065 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -109,15 +109,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 36dcb28754..5b6de31ff6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ed51da4234..5c7e83ffdb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4085,17 +4084,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11363,35 +11351,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..0ba984b074 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -562,7 +562,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2196b0bd38..4e0a05b2ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3

#! /usr/bin/bash

DURATION=300
DURATION2=1500
ROOT=/home/horiguti
BIN1PATH=$ROOT/bin/pgsql_master_o2/bin
BIN2PATH=$ROOT/bin/pgsql_shmemstat/bin
DATAPATH=$ROOT/data/data_shmemstat

run() {
    local BINARY=$1

    echo "## $BINARY-A1"
    pgbench -j 1 -c 1 -T $DURATION -b select-only postgres
    echo "## $BINARY-B1"
    pgbench -j 100 -c 100 -T $DURATION -b select-only postgres
    echo "## $BINARY-C1"
    pgbench -j 1 -c 1 -T $DURATION -f tr.sql postgres
    echo "## $BINARY-D1"
    pgbench -j 100 -c 100 -T $DURATION -f tr.sql postgres

    echo "## $BINARY-A2"
    ((pgbench -j 1 -c 1 -T $DURATION -f ref.sql postgres | sed -e 's/^/1:/') & (pgbench -j 1 -c 1 -T $DURATION -b
select-onlypostgres | sed -e 's/^/2:/')); sleep 1
 
    echo "## $BINARY-B2"
    ((pgbench -j 1 -c 1 -T $DURATION -f ref.sql postgres | sed -e 's/^/1:/') & (pgbench -j 97 -c 97 -T $DURATION -b
select-onlypostgres | sed -e 's/^/2:/')); sleep 1
 
    echo "## $BINARY-C2"
    ((pgbench -j 1 -c 1 -T $DURATION -f ref.sql postgres | sed -e 's/^/1:/') & (pgbench -j 1 -c 1 -T $DURATION -f
tr.sqlpostgres | sed -e 's/^/2:/')); sleep 1
 
    echo "## $BINARY-D2"
    ((pgbench -j 1 -c 1 -T $DURATION -f ref.sql postgres | sed -e 's/^/1:/') & (pgbench -j 97 -c 97 -T $DURATION -f
tr.sqlpostgres | sed -e 's/^/2:/')); sleep 1
 
}


$BIN1PATH/pg_ctl -D $DATAPATH start
run "o2"
$BIN1PATH/pg_ctl -D $DATAPATH stop -m s

$BIN2PATH/pg_ctl -D $DATAPATH start
run "shmem"
$BIN2PATH/pg_ctl -D $DATAPATH stop -m s

#! /usr/bin/perl

print "drop table if exists p;\n";
print "create table p (a int) partition by list (a);\n";

for ($i = 0 ; $i < 1000 ; $i++)
{
    printf("create table c%03d partition of p for values in (%d);\n", $i, $i);
    printf("insert into p values(%d);\n", $i);
}


select sum(a) from p;
select sum(seq_scan) from pg_stat_user_tables;

Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
Hi.

At Fri, 17 May 2019 14:27:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20190517.142722.139901807.horiguchi.kyotaro@lab.ntt.co.jp>
> The attached files are:

It's broken perhaps by recent core change?

I'll fix it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: shared-memory based stats collector

От
Kyotaro HORIGUCHI
Дата:
me> It's broken perhaps by recent core change?
me> 
me> I'll fix it.

Not by core change, but my silly mistake in memory context usage.

Fixed. Version 20.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1a97564fb9e392d0dfb067cb4056041d013f4f0e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index f095196fb6..1e8c22f94f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index e5dfd57f0a..b80f3af995 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 9a43133d64d763e860c027c32758074db6b7e3ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  4 +++
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1e8c22f94f..303210e326 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b80f3af995..21587c07ce 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,12 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
             const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+            bool exclusive, bool nowait, bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                       const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+            const void *key, bool *found, bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 75c00b47b152c5aac3921f4e0d80707bf7a236ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 43627ab8f4..7872a2d9d7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 285def556b..ccb1d0b62e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 34315b8d1a..ec9a7ca311 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2977,7 +2979,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3122,10 +3124,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3371,7 +3371,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3576,6 +3576,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3848,6 +3860,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5117,7 +5130,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5400,6 +5413,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index b677c7e821..c33abee7aa 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 929987190c..f3d4cb5637 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From d0d1f4fbdcd43a32fddd810d63a3798304ec3575 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5613 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2591 insertions(+), 3619 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a179d6111e..0fe888e4db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index acd8a9280b..6d5c91dd45 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ccb1d0b62e..24dab5aa54 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,24 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of backends and
+ *  shares them among all backends using shared memory. Every backend writes
+ *  individual activity in local memory using pg_count_*() and friends
+ *  interfaces during a transaction. Then pgstat_flush_stat() is called at the
+ *  end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we do that not often than
+ *  PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is possible that a backend
+ *  cannot flush all or a part of local numbers immediately, we preserve them
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process writes the shared stats to
+ *  file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +28,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +39,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +86,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +114,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
  * BgWriter global statistics counters (unused in other processes).
  * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * without needing to copy things around.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +205,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +216,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of stats entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception that it stores data for all databases.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    n_tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check n_tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +313,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    snapshot_cleard = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,24 +325,28 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 /* Total number of backends including auxiliary */
 static int    localNumBackends = 0;
 
-/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+/* Context struct for pgstat_flush_* functions */
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;
+    PgStat_StatDBEntry *shdbentry;
+    dshash_table *shdb_tabhash;
+
+    int    mygeneration;
+    PgStat_StatDBEntry *mydbentry;
+    dshash_table *mydb_tabhash;
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are not collected per database or per table.
+ * shared_* points to shared memroy and snapshot_* are backend snapshots.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +360,40 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,364 +401,160 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+
+    
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach to shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts, or when
+     * already attached.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (StatsShmem->refcount > 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* Statistics exists in shared memory. Just attach to it */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
     }
-    FreeDir(dir);
+    else
+    {
+        /* Need to create shared stats */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+    
+        /* Setup shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
+        StatsShmem->refcount = 0;
+    }
+
+    /* Setup local variables */
+    pgStatLocalHash = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+
+    dsa_pin_mapping(area);
+
+    /* Load data if we've just created the shared area. */
+    if (StatsShmem->refcount == 0)
+        pgstat_read_statsfiles();
+
+    StatsShmem->refcount++;
+
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file we're the last process if
+ *  write_stats is true.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1 && write_stats)
+        pgstat_write_statsfiles();
+
+    /*
+     * Detach the area. Automatically destroyed when the last processes
+     * detached it.
+     */
+    dsa_detach(area);
+
+    /* We're the last process. Invalidate the dsa area handle. */
+    if (StatsShmem->refcount < 1)
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    LWLockRelease(StatsLock);
 }
 
 /*
@@ -685,112 +566,18 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +585,285 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    This requires taking some locks on the shared statistics hashes and some
+ *    of updates may be postponed on lock failure. Such postponed updates are
+ *    retried in later call of this function and finally cleaned up in
+ *    PGSTAT_STAT_MAX_INTERVAL milliseconds by the final cleanup specified by
+ *    force = true. On the other hand updates by regular backends happen with
+ *    the interval not shorter than PGSTAT_STAT_MIN_INTERVAL when force = false.
+ *
+ *    Returns the time until the next update time in milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fair to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's not the time yet.  Returns time to
+         * wait in seconds.
+         */
+        if (now < next_flush)
+        {
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / USECS_PER_SEC;
+        }
+
+        /*
+         * Don't keep pending stats for longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /*  Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* record how long we keep pending stats */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval after "now" is later than the
+         * limit by PGSTAT_STAT_MAX_INTERVAL, but it's not so much. We don't
+         * bother that.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot entry for the key or NULL if not
+ *  found.
+ *
+ *  Returned entries are stable during the current transaction or until
+ *  pgstat_clear_snapshot() is called.
+ *
+ *  cxt->hash points to a variable that points to a HTAB to store snapshot
+ *  entries created by this function using hash_name, hash_entsize in cxt.
+
+ *
+ *  cxt->dshash is a working variable that points to a dshash_table storing
+ *  shared entryies. cxt->dsh_handle specifies the dshash to be attached.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Don't clear it for
+     * PGSTAT_STAT_MIN_INTERVAL ms.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (snapshot_cleard)
+    {
+        snapshot_cleard = false;
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats->stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+ 
+            /* Reset variables */
+            pgStatSnapshotContext = NULL;
+            snapshot_globalStats = NULL;
+            snapshot_archiverStats = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+    
+    /*
+     * Create new hash with arbitrary initial number entries since we don't
+     * know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The real size of this hash entry is given struct size plus common
+         * header part of PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = cxt->hash_entsize + offsetof(PgStat_snapshot, body);
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    if (!found)
+    {
+        /* not found in local cache, search shared hash */
+
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /* Found. Copy it */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired *
+ *  immediately. In the case, infos of some tables may be left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to keep during the shared stats update.  Returns true
+ *  if all stats info are flushed. Caller must detach dshashes stored in cxt
+ *  after use.
+ *
+ *  Returns true if all entries are flushed.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared statistics. We may fail
+     * to flush some entries in the array. Leaving the entries packing at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +876,340 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Leave it alone packing to the beginning in
+                 * TabStatusArray.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /* Move insertion pointer to the next segment. */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /* Move the entry if needed */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries , but still may
+     * use that for my database.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach the required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /* We don't have corresponding dbentry here */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold dshash-lock on dbentries, since the dbentries cannot
+         * be dropped meanwhile.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We attach mydb tabhash once per flushing. This is the chance to
+             * update database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * dbentry is always available here, so try flush table stats first, then
+     * database stats.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure and leave some of the
+ *  entries alone in the local hash.
+ *
+ *  Returns true if all entries are flushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared statistics.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all the miscellaneous stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* Lock failure, return. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects he can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect statistics under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1217,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1307,96 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    /* The first member of the entries must be Oid */
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it (along with the db statfile).
+     */
+    if (dbentry)
+    {
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+        Assert(dbentry->refcnt == 0);
+
+        /* One one must live on this database. It's safe to drop all. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1405,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1439,35 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+        /* Reset the global background writer statistics for the cluster. */
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1476,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1525,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1610,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1646,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1810,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.n_tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.n_tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.n_tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2108,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2128,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2161,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2774,34 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2814,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2888,93 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (snapshot_globalStats)
+        return;
+
+    Assert(snapshot_archiverStats == NULL);
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    /* global stats can be just copied  */
+    LWLockAcquire(StatsLock, LW_SHARED);
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3046,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3064,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -2836,8 +3281,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3380,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3516,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3555,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3817,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4113,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4772,39 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (failed)
+    {
+        /* Failed archival attempt */
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = GetCurrentTimestamp();
+    }
+    else
+    {
+        /* Successful archival operation */
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = GetCurrentTimestamp();
+    }
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4813,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4823,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4843,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * All counters are reset. Tables and functions dshashes are destroyed.  If
+ * any backend is pinning this dbentry, the current dshashes are stashed out to
+ * the previous "generation" to wait for all accessors gone. If the previous
+ * generation is already occupied, the current dshashes are so fresh that they
+ * doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5025,860 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. Removing individual
+         * entries in dshash is very costly so just destroy it.  If someone
+         * pined this entry just after, pin_hashes returns the current
+         * generation and attach waits for the following LWLock.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to release. It must be quite a
+         * short time so we can just ignore this request.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Recreate now if needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is the output buffer, of
+ * length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        snapshot_cleard  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5911,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ec9a7ca311..02ce657daf 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1303,12 +1301,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1757,11 +1749,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2647,8 +2634,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2980,8 +2965,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3048,13 +3031,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3129,22 +3105,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3603,22 +3563,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3814,8 +3758,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3850,8 +3792,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4052,8 +3993,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5026,18 +4965,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5150,12 +5077,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6044,7 +5965,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6100,8 +6020,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6336,7 +6254,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d7d733530f..fdc0959624 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -275,8 +276,13 @@ CreateSharedMemoryAndSemaphores(int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc1aa88322..b9c33d6044 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..2f9dd19ab6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3152,6 +3152,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3726,6 +3732,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4166,9 +4173,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4203,7 +4218,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4211,6 +4226,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index a5950c1e8c..b24631b7b1 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e9f72b5069..731ef0e27c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -633,6 +634,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 33869fecc9..4f656c98a3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index c33abee7aa..6aa9e3c121 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f3d4cb5637..2196b0bd38 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +52,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +89,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,236 +147,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -423,38 +165,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +204,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz    m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -615,16 +245,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -663,7 +306,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +321,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +337,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -780,7 +423,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1215,6 +857,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1236,29 +880,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 08e0dc8144..30d5fb63c5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 4009517f9dc159e2601b0e9f788db645a296a0d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 9 files changed, 14 insertions(+), 90 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b67da8916a..f0daffbc93 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1145,8 +1145,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..6ad619de47 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6827,25 +6827,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0fe888e4db..31fc3604ea 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index c4bac87e80..c264f839e7 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 24dab5aa54..94df252597 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -109,15 +109,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 36dcb28754..5b6de31ff6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ed51da4234..5c7e83ffdb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4085,17 +4084,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11363,35 +11351,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..0ba984b074 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -562,7 +562,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2196b0bd38..4e0a05b2ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
-- 
2.16.3


Re: shared-memory based stats collector

От
Thomas Munro
Дата:
On Fri, May 17, 2019 at 6:48 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Fixed. Version 20.

Hello Horiguchi-san,

A new Commitfest is here.  This doesn't apply (maybe just because of
the new improved pgindent).  Could we please have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Hello.

At Mon, 1 Jul 2019 23:19:31 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in
<CA+hUKGK5WNCEe9g4ie=-6Oym-WNqYBXX9A1qPgKv89KGkzW72g@mail.gmail.com>
> > Fixed. Version 20.
> 
> Hello Horiguchi-san,
> 
> A new Commitfest is here.  This doesn't apply (maybe just because of
> the new improved pgindent).  Could we please have a fresh rebase?

Thank you for noticing, Thomas. I rebased and made some possible
improvement. More than that, I wrote a new test script.

- Rebased.

- Reworded almost all of comments. Many of them was found
  broken. Added some comments.

- Shortened some LWLocked code paths.

- Got rid of useless palloc for snapshot area of globalStats.

The attached files are:

gendb.pl:
  script to generate databases.

statbehch.pl:
  benchmarking script.

0001-sequential-scan-for-dshash.patch:
  Adds sequential scan feature to dshash

0002-Add-conditional-lock-feature-to-dshash.patch:
  Adds conditional lock feature to dshash

0003-Make-archiver-process-an-auxiliary-process.patch:
  Change archiver process to auxiliary process. This is needed to
  let archive access shared memory.

0004-Shared-memory-based-stats-collector.patch:
  The body of this patchset. Changes pgstat from a process
  connected by socket to shared memory shared among backends.

0005-Remove-the-GUC-stats_temp_directory.patch:
  Remove GUC stats_temp_directory. Separated from 0004 to make it
  smaller.

====
Tomas said upthread:
> > While reviewing the patch I've always had issue with evaluating how it
> > behaves for various scenarios / workloads. The reviews generally did
> > one
> > specific benchmark, but I find that unsatisfactory. I wonder whether
> > if
> > we could develop a small set of more comprehensive workloads for this
> > patch (i.e. different numbers of objects, access patterns, ...).

The structure of shared stats area follows:

  dshash for databsae stats
   + dshash entry for db1
      + dshash for table stats
          + dshash entry for tbl1
          + dshash entry for tbl2
          + dshash entry for tbl3
           ...
   + db2
     ...
   + db3
   ...

Dshash restiricts an entry to be accessed only by a single
process. This is quite inconvenient since that a database hash
entry becomes a bottle neck. On the other hand dbstat dshash
entry used on a backend is not removed since it is removed after
all accessor to the databse have gone. So this patch immediately
releases dbstats dshash entry so that it doesn't become a
bottleneck.

Another bottle neck would be lock conflicts on a
database/table/function stats entry. This is avoided by, like
existing stats collector, enforces intervals not shorter than 500
ms (PGSTAT_STAT_MIN_INTERVAL) between two successive updates on
one process and skipping the update if lock is going to conflict.

Yet another bottle neck was conflict between reset and
udpate. Since all processes are working on the same data on
shared memory, counters cannot be reset until all referer are
gone. So I let dbentry have two sets of table/function stats
dshash in order to separate accessors come after reset and
existing accessors. A process "pins" the current table/function
dshashes before accessing them (pin_hashes()/unpin_hashes()). All
updates in the round are performed on the "pinned" generation of
dshashes.  If two or more successive reset requests come in a
very short time, the requests other than the first one are just
ignored (reset_dbentry_counters()). So client can see some
non-zero numbers just after reset if many processes reset stats
at the same time but I don't think that is worth amending.

After all, almost all potential bottle necks are eliminated in
this patch. If so many n clients are running, the mean interval
of updates would be 500/n ms so 1000 or more clients can hit by
the bottle neck but I think also the current stats collecter
suffers from such many clients. (No matter what would happen with
such massive number of processes, I don't have an environment
that can let such many clients/backends live on...)

That being said, this patch is doomed to update stats in
reasonable period, 1000ms in this patch
(PGSTAST_STAT_MAX_INTERVAL). If that duration elapsed, this patch
waits all required locks to be acquired. So, the possible bottle
neck is still on database and table/function shard hash
entries. It is efficiently causes the conflicts that many
processes on the same database update the same table.

I remade benchmark script so that many parameters can be changed
easily. I took numbers for the following access patterns. Every
number is the mean of 6 runs. I choosed the configurations so
that no disk access happenes while running benchmark.

#db      : number of accessing database
#tbl     : number of tables per database
#client  : number of stats-updator clients
#iter    : number of query iterations
#xactlen : number of queries in a transaction
#referers: number of stats-referenceing clients

    #db  #tbl  #clients  #iter  #xactlen #referers
A1:   1     1         1  20000        10        0
A2:   1     1         1  20000        10        1
B1:   1     1        90   2000        10        0
B2:   1     1        90   2000        10        1
C1:   1    50        90   2000        10        0
C2:   1    50        90   2000        10        1
D1:  50     1        90   2000        10        0
D2:  50     1        90   2000        10        1
E1:  50     1        90   2000        10       10
F1:  50     1        10   2000        10       90



                master                               patched
        updator       referrer               updator       referrer       
       time / stdev  count / stdev          time / stdev   count / stdev    
A1: 1769.13 / 71.87                      1729.97 / 61.58
A2: 1903.94 / 75.65  2906.67 /  78.28    1849.41 / 43.00  2855.33 /  62.95
B1: 2967.84 /  9.88                      2984.20 /  6.10
B2: 3005.38 /  5.32   253.00 /  33.09    3007.26 /  5.70   253.33 /  60.63
C1: 3066.14 / 13.80                      3069.34 / 11.65
C2: 3353.66 /  8.14   282.92 /  20.65    3341.36 / 12.44   251.65 /  21.13
D1: 2977.12 /  5.12                      2991.60 /  6.68
D2: 3005.80 /  6.44   252.50 /  38.34    3010.58 /  7.34   282.33 /  57.07
E1: 3255.47 /  8.91   244.02 /  17.03    3293.88 / 18.05   249.13 /  14.58
F1: 2620.85 /  9.17   202.46 /   3.35    2668.60 / 41.04   208.19 /   6.79


ratio (100: same, smaller value means patched version is faster)

          updator          referrer
     patched/master(%)    master/patched (%)
A1:          97.79            -
A2:          97.14          101.80
B1:         100.55
B2:         100.06           99.87
C1:         100.10
C2:          99.63          112.43
D1:         100.49
D2:         100.16           89.43
E1:         101.18           97.95
F1:         101.82           97.25


Mmm... I don't see distinctive tendency.. Referrer side shows
larger fluctuation but I'm not sure that suggests something
meaningful.

I'll rerun the bencmarks with loger period (many itererations).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

use strict;

my $ndbs = 100;        # number of databases
my $ntbls = 100;    # number of tables per database

for (my $i = 0 ; $i < $ndbs ; $i++)
{
    system(sprintf("dropdb db%03d;\n", $i));
    system(sprintf("createdb db%03d;\n", $i));
  
    for (my $j = 0 ; $j < $ntbls ; $j++)
    {
        system(sprintf("psql db%03d -c \"drop table if exists t%03d; create table t%03d as select v as a, v * 2 as b,v
*3 as c, v * 4 as d from generate_series(0, 9) v;\"", $i, $j, $j));
 
    }
}
#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes  qw( usleep ualarm gettimeofday tv_interval );
use Errno qw/ECHILD/;

my $ndbs = 50;
my $ntbls = 100;

my $loops = 6;
my $chiter = 2000;
my $chtrlen = 10;
my $nprocs = 90;

my $trg_file = '/tmp/dbrun_pl.trg';
my $refresult_file = '/tmp/refresult.txt';

#       title  loops,ndbs,clnts,ntbls,childiter   ,  xactlen,refprocs
testrun("A1", $loops,   1,    1,    1,  $chiter*10, $chtrlen,     0);
testrun("A2", $loops,   1,    1,    1,  $chiter*10, $chtrlen,     1);
testrun("B1", $loops,   1,   90,    1,  $chiter   , $chtrlen,     0);
testrun("B2", $loops,   1,   90,    1,  $chiter   , $chtrlen,     1);
testrun("C1", $loops,   1,   90,   50,  $chiter   , $chtrlen,     0);
testrun("C2", $loops,   1,   90,   50,  $chiter   , $chtrlen,    10);
testrun("D1", $loops,  50,   90,    1,  $chiter   , $chtrlen,     0);
testrun("D2", $loops,  50,   90,    1,  $chiter   , $chtrlen,     1);
testrun("E1", $loops,  50,   90,    1,  $chiter   , $chtrlen,    10);
testrun("F1", $loops ,  1,   10,    1,  $chiter   , $chtrlen,    90);
exit;

sub testrun
{
    my ($test_name, $loops, $ndbs, $nprocs, $ntbls,
        $childiter, $childtrlen, $nreferers)
        = @_;

    my @results = ();
    my @refresults = ();

    # run each iteration
    for (my $l = 0 ; $l < $loops ; $l++)
    {
        my %starttime = ();
        my %endtime = ();

        # This file is used for stopping free-running processes.
        open(OUT, '>', $trg_file) || die "failed to open file:$!\n";
        print OUT "$$\n";
        close(OUT);

        # It's very mysterious that one dummy subprocess makes things
        # stable. Perl runs slowly if there's only one psql process
        # without this...
        {
            my $pid = fork;
            if ($pid == 0)
            {
                while (-f $trg_file)
                {
                    usleep (500000);
                }
                exit;
            }
        }

        # start referer processes, collecting how many times the query ran.
        pipe(my $refresrd, my $refreswr);
        for(my $i = 0 ; $i < $nreferers ; $i++)
        {
            my $pid = fork;

            if ($pid < 0) { die "fork referer process failed : $!\n"; }
            
            if ($pid == 0)
            {
                my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres");
                my $count = 0;

                close($refresrd);

                if ($pid < 0) { die "fork psql failed : $!\n"; }

                while (-f $trg_file)
                {
                    print $psqlwr "select * from pg_stat_user_tables;\n";
                    while (<$psqlrd>)
                    {
                        last if ($_ =~ /rows\)/);
                    }
                    $count++;
                }
                print $refreswr "$count\n";
                exit;
            }
        }
        close($refreswr);

        # launch updator processes
        for (my $i = 0 ; $i < $nprocs ; $i++)
        {
            my $dbn = rand($ndbs);  # equal dist.
            my $dbname = sprintf("db%03d", $dbn);

            my $pid = fork;

            if ($pid < 0)
            {
                die "fork failed: $!\n";
            }
            elsif ($pid == 0)
            {
                my $pid = open2(my $rd, my $wr, "psql $dbname > /dev/null");
                if ($pid < 0) { die "sub fork failed: $!\n"; }

                my $ncmd = $childtrlen;
                #print $wr "set log_min_duration_statement to 0;\n";
                print $wr "begin;\n";
                for (my $i = 0 ; $i < $childiter ; $i++)
                {
                    my $tbn = rand($ntbls);  # equal dist.

                    printf $wr "select /* $dbname\[$i\]*/ count(*) from t%03d;\n", $tbn;
                    if (--$ncmd == 0)
                    {
                        print $wr "commit;begin;\n";
                        $ncmd = $childtrlen;
                    }
                }

                print $wr "commit;\n";
                print $wr "\\q\n";
                my $res = <$rd>;
                exit;
            }

            (my $sec, my $usec) = &gettimeofday();
            $starttime{$pid} = $sec * 1000 + $usec / 1000;
            # print "start\[$pid\] = $starttime{$pid}\n";
        }

        # wait for updateors to finish
        for (my $i = 0 ; $i < $nprocs ; $i++)
        {
            my $pid = wait();

            if ($pid < 0) { if ($! != ECHILD) {die "???: $!\n"; }}
            redo if (!defined $starttime{$pid});

            (my $sec, my $usec) = &gettimeofday();
            $endtime{$pid} = $sec * 1000 + $usec / 1000;
            # printf "%d[%d]: %d - %d  = %d ms\n", $i, $pid, $endtime{$pid}, $starttime{$pid}, $endtime{$pid} -
$starttime{$pid};
        }

        my $sum = 0;
        foreach my $pid (keys %starttime)
        {
            $sum += $endtime{$pid} - $starttime{$pid};
            # printf "[%d]: %d ms (%d, %d)\n", $pid, $endtime{$pid} - $starttime{$pid}, $starttime{$pid},
$endtime{$pid};
        }

        push(@results, $sum / $nprocs);

        # kill referers if any (and dummy process)
        unlink($trg_file);
        while (wait() == 0) {}
        my $nrefs = 0;
        my $refcount = 0;
        while(<$refresrd>)
        {
            chomp;
            $refcount += $_;
            $nrefs++;
        }
        close($refresrd);
        
        push (@refresults, $refcount / $nrefs) if ($nrefs > 0);
    }

    # calculate stdev
    (my $updmean, my $updstdev) = stdev(@results);
    (my $refmean, my $refstdev) = stdev(@refresults);

    printf "$test_name (l:%d, d:%d, t:%d, i:%d, tr:%d): %.2f ms (stdev %.2f) / %d updprocs, %.2f refs (stdev %.2f) / %d
refprocs\n",$loops, $ndbs, $ntbls, $childiter, $childtrlen, $updmean, $updstdev, $nprocs, $refmean, $refstdev,
$nreferers;
}

sub stdev
{
    my $sum = 0;
    my $sqsum = 0;
    my $count = $#_ + 1;

    return (0, 0) if $count == 0;

    foreach my $el (@_)
    {
        $sum += 1.0 * $el;
        $sqsum += 1.0 * $el * $el;
    }
    my $mean = $sum / $count;
    my $stdev = sqrt($sqsum / $count - $mean * $mean);

    return ($mean, $stdev);
}


From cc611bcfe8dd253fa3ac4e6b4feaf86105ca7e57 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 24dd372252..a0ccd7ee7c 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index fa2e28ff3e..79698a6ad6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 6ac733988c6bf767b2cd514d493c5712031cae9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 +++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index a0ccd7ee7c..4191e37717 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 79698a6ad6..67f7d77f71 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 4b23066adcc24b5c703ba0145dec9ef1ed60479f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 43627ab8f4..7872a2d9d7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28b51..f4ec142cab 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 688ad439ed..4574ebf2de 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2977,7 +2979,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3122,10 +3124,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3371,7 +3371,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3576,6 +3576,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3848,6 +3860,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5117,7 +5130,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5400,6 +5413,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 61a24c2e3c..0b49b63327 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a188..b3f00e1943 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 02d5d0a85b6a3db5809402f673e46ac4ce8c2599 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5661 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |  139 +-
 src/backend/storage/ipc/ipci.c               |    6 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2641 insertions(+), 3671 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c303..990995c17b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index fd85b9c8f4..7cb5670a47 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4ec142cab..514ea78a68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends interfaces stores activity of
+ *  every backend during a transaction. Then pgstat_flush_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is
+ *  possible that a backend cannot flush all or a part of local numbers
+ *  immediately, we postpone updates and try the next chance after the
+ *  interval of PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept
+ *  longer than PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +38,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +85,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +113,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * BgWriter global statistics counters. The name is the remnant from the time
+ * when the stats collector was a dedicate process, which used sockets to send
+ * it.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +204,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +215,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +312,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -250,23 +325,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached
+ * is stored in this structure and moved around multiple calls and multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +367,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,481 +409,197 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
+     */
+    if (!area)
+    {
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+    }
+
+    MemoryContextSwitchTo(oldcontext);
+
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and
+ *    instructed to write file.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +607,293 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
+
+        /*
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
+ *
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +906,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1259,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1349,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
+
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1450,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1484,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1523,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1572,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1657,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1693,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1857,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2155,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2175,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2208,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2821,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2860,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2934,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3089,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3107,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2836,8 +3324,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3423,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3559,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3598,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3860,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4156,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4815,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4860,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4870,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4890,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5072,865 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5963,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4574ebf2de..05d143b624 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1249,66 +1247,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Forcibly remove the files signaling a standby promotion request.
-     * Otherwise, the existence of those files triggers a promotion too early,
-     * whether a user wants that or not.
-     *
-     * This removal of files is usually unnecessary because they can exist
-     * only during a few moments during a standby promotion. However there is
-     * a race condition: if pg_ctl promote is executed and creates the files
-     * during a promotion, the files can stay around even after the server is
-     * brought up to new master. Then, if new standby starts by using the
-     * backup taken from that master, the files can exist at the server
-     * startup and should be removed in order to avoid an unexpected
-     * promotion.
-     *
-     * Note that promotion signal files need to be removed before the startup
-     * process is invoked. Because, after that, they can be used by
-     * postmaster's SIGUSR1 signal handler.
-     */
-    RemovePromoteSignalFiles();
-
-    /* Do the same for logrotate signal file */
-    RemoveLogrotateSignalFiles();
-
-    /* Remove any outdated file holding the current log filenames. */
-    if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT)
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not remove file \"%s\": %m",
-                        LOG_METAINFO_DATAFILE)));
-
-    /*
-     * If enabled, start up syslogger collection subprocess
-     */
-    SysLoggerPID = SysLogger_Start();
-
-    /*
-     * Reset whereToSendOutput from DestDebug (its starting state) to
-     * DestNone. This stops ereport from sending log messages to stderr unless
-     * Log_destination permits.  We don't do this until the postmaster is
-     * fully launched, since startup failures may as well be reported to
-     * stderr.
-     *
-     * If we are in fact disabling logging to stderr, first emit a log message
-     * saying so, to provide a breadcrumb trail for users who may not remember
-     * that their logging is configured to go somewhere else.
-     */
-    if (!(Log_destination & LOG_DESTINATION_STDERR))
-        ereport(LOG,
-                (errmsg("ending log output to stderr"),
-                 errhint("Future log output will go to log destination \"%s\".",
-                         Log_destination_string)));
-
-    whereToSendOutput = DestNone;
-
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1757,11 +1695,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2647,8 +2580,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2980,8 +2911,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3048,13 +2977,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3129,22 +3051,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3603,22 +3509,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3814,8 +3704,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3850,8 +3738,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4052,8 +3939,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5026,18 +4911,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5150,12 +5023,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6044,7 +5911,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6100,8 +5966,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6336,7 +6200,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d7d733530f..fdc0959624 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -275,8 +276,13 @@ CreateSharedMemoryAndSemaphores(int port)
 
     /* Initialize dynamic shared memory facilities. */
     if (!IsUnderPostmaster)
+    {
         dsm_postmaster_startup(shim);
 
+        /* Stats collector uses dynamic shared memory  */
+        StatsShmemInit();
+    }
+
     /*
      * Now give loadable modules a chance to set up their shmem allocations
      */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc1aa88322..b9c33d6044 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 44a59e1d4f..2f9dd19ab6 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3152,6 +3152,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3726,6 +3732,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4166,9 +4173,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4203,7 +4218,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4211,6 +4226,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..9c694f20c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e9f72b5069..731ef0e27c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -633,6 +634,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0b49b63327..00d939991d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b3f00e1943..c0bbf8a7d5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +52,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +89,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,236 +147,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -423,38 +165,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +204,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -615,16 +245,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -663,7 +306,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +321,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +337,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -780,7 +423,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1215,6 +857,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1236,29 +880,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 08e0dc8144..30d5fb63c5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 76659a7965d3eec68e3d6b4b961dc8cd5b72063e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 src/test/perl/PostgresNode.pm                 |  4 ---
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 9d4c000df0..bbae70221e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1145,8 +1145,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..6ad619de47 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6827,25 +6827,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 990995c17b..2a2adaa0f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1047c77a63..b47f8d084e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 514ea78a68..64e1374288 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -108,15 +108,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c2978a949a..e784088747 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 92c4fee8f8..7feb44c39d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4085,17 +4084,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11368,35 +11356,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5ee5e09ddf..0ba984b074 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -562,7 +562,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c0bbf8a7d5..2c7c799ca9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37f91..a107683c0d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -451,10 +451,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.16.3


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Thu, 04 Jul 2019 19:27:54 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
<20190704.192754.27063464.horikyota.ntt@gmail.com>
>     #db  #tbl  #clients  #iter  #xactlen #referers
> A1:   1     1         1  20000        10        0
> A2:   1     1         1  20000        10        1
> B1:   1     1        90   2000        10        0
> B2:   1     1        90   2000        10        1
> C1:   1    50        90   2000        10        0
> C2:   1    50        90   2000        10        1
> D1:  50     1        90   2000        10        0
> D2:  50     1        90   2000        10        1
> E1:  50     1        90   2000        10       10
> F1:  50     1        10   2000        10       90
> 
> 
> 
>                 master                               patched
>         updator       referrer               updator       referrer       
>        time / stdev  count / stdev          time / stdev   count / stdev    
> A1: 1769.13 / 71.87                      1729.97 / 61.58
> A2: 1903.94 / 75.65  2906.67 /  78.28    1849.41 / 43.00  2855.33 /  62.95
> B1: 2967.84 /  9.88                      2984.20 /  6.10
> B2: 3005.38 /  5.32   253.00 /  33.09    3007.26 /  5.70   253.33 /  60.63
> C1: 3066.14 / 13.80                      3069.34 / 11.65
> C2: 3353.66 /  8.14   282.92 /  20.65    3341.36 / 12.44   251.65 /  21.13
> D1: 2977.12 /  5.12                      2991.60 /  6.68
> D2: 3005.80 /  6.44   252.50 /  38.34    3010.58 /  7.34   282.33 /  57.07
> E1: 3255.47 /  8.91   244.02 /  17.03    3293.88 / 18.05   249.13 /  14.58
> F1: 2620.85 /  9.17   202.46 /   3.35    2668.60 / 41.04   208.19 /   6.79
> 
> 
> ratio (100: same, smaller value means patched version is faster)
> 
>           updator          referrer
>      patched/master(%)    master/patched (%)
> A1:          97.79            -
> A2:          97.14          101.80
> B1:         100.55
> B2:         100.06           99.87
> C1:         100.10
> C2:          99.63          112.43
> D1:         100.49
> D2:         100.16           89.43
> E1:         101.18           97.95
> F1:         101.82           97.25
> 
> 
> Mmm... I don't see distinctive tendency.. Referrer side shows
> larger fluctuation but I'm not sure that suggests something
> meaningful.
> 
> I'll rerun the bencmarks with loger period (many itererations).

I put more puressure to the system.

G1: 1 db, 400 clients, 1000 tables, 20000 loops/client, 1000 query/tr, 0 reader
G2: 1 db, 400 clients, 1000 tables, 20000 loops/client, 1000 query/tr, 1 reader


Result:
                master                               patched
           updator        referrer               updator       referrer       
         time / stdev   count / stdev          time / stdev   count / stdev    
G1: 125946.22 / 796.83                    125227.24 / 89.82
G2: 126463.47 / 81.87 1985.70 /  33.96    125427.95 / 82.35 1985.60 /  55.24

Ratio: (100: same, smaller value means patched version is faster)

          updator          referrer
     patched/master(%)    master/patched (%)
G1:          99.40            -
G2:          99.18          100.0

Slightly faster, or maybe significantly enough considering the
stdev. More crucial difference is shown outside the
numbers. Non-patched version complained that (incorrectly) "stats
collector not respond, used stale stats" many times, which is not
seen on patched version. That means that the reader reads older
numbers as far as 1 second ago. (It might be good that writer
should complain about update holdoff for more than, say 0.75s.)


CF-bot warned that it doesn't work on Windows. I'm experiencing
an very painful time to wait for tortoise git is walking slowly
as its name suggests. It would be fixed in the next version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Hello. This is v21 of the patch.

> CF-bot warned that it doesn't work on Windows. I'm experiencing
> an very painful time to wait for tortoise git is walking slowly
> as its name suggests. It would be fixed in the next version.

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d1bf3bfb1f539c28dbb2dd77a74d66919efd48bd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 350f8c0a66..4f0c7ec840 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index fa2e28ff3e..79698a6ad6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 05e7fbe33c8d03ae386dfb08c5b503d5510c173b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 +++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 4f0c7ec840..60a6e3c0bc 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 79698a6ad6..67f7d77f71 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From b511a3aba3abc98f079a232613eca923114e3ced Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 43627ab8f4..7872a2d9d7 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b4f2b28b51..f4ec142cab 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3339804be9..0ca0b3024b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2977,7 +2979,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3122,10 +3124,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3371,7 +3371,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3576,6 +3576,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3848,6 +3860,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5117,7 +5130,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5400,6 +5413,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 61a24c2e3c..0b49b63327 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a3ad3a188..b3f00e1943 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,6 +719,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From de14aa03df79a0fa8bdffa259e999ea5791abf5b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5661 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |  139 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2637 insertions(+), 3671 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bf72d0c303..990995c17b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 073f313337..a222817f55 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4ec142cab..514ea78a68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends interfaces stores activity of
+ *  every backend during a transaction. Then pgstat_flush_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is
+ *  possible that a backend cannot flush all or a part of local numbers
+ *  immediately, we postpone updates and try the next chance after the
+ *  interval of PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept
+ *  longer than PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +38,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +85,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +113,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * BgWriter global statistics counters. The name is the remnant from the time
+ * when the stats collector was a dedicate process, which used sockets to send
+ * it.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +204,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +215,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +312,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -250,23 +325,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached
+ * is stored in this structure and moved around multiple calls and multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +367,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,481 +409,197 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
+     */
+    if (!area)
+    {
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+    }
+
+    MemoryContextSwitchTo(oldcontext);
+
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and
+ *    instructed to write file.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +607,293 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
+
+        /*
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
+ *
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +906,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1259,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1349,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
+
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1450,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1484,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1523,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1572,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1657,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1693,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1857,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2155,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2175,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2208,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2821,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2860,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2934,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3089,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3107,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2836,8 +3324,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3423,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3559,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3598,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3860,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4156,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4815,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4860,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4870,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4890,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5072,865 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5963,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetshared() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0ca0b3024b..ef35bef218 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1249,66 +1247,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Forcibly remove the files signaling a standby promotion request.
-     * Otherwise, the existence of those files triggers a promotion too early,
-     * whether a user wants that or not.
-     *
-     * This removal of files is usually unnecessary because they can exist
-     * only during a few moments during a standby promotion. However there is
-     * a race condition: if pg_ctl promote is executed and creates the files
-     * during a promotion, the files can stay around even after the server is
-     * brought up to new master. Then, if new standby starts by using the
-     * backup taken from that master, the files can exist at the server
-     * startup and should be removed in order to avoid an unexpected
-     * promotion.
-     *
-     * Note that promotion signal files need to be removed before the startup
-     * process is invoked. Because, after that, they can be used by
-     * postmaster's SIGUSR1 signal handler.
-     */
-    RemovePromoteSignalFiles();
-
-    /* Do the same for logrotate signal file */
-    RemoveLogrotateSignalFiles();
-
-    /* Remove any outdated file holding the current log filenames. */
-    if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT)
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not remove file \"%s\": %m",
-                        LOG_METAINFO_DATAFILE)));
-
-    /*
-     * If enabled, start up syslogger collection subprocess
-     */
-    SysLoggerPID = SysLogger_Start();
-
-    /*
-     * Reset whereToSendOutput from DestDebug (its starting state) to
-     * DestNone. This stops ereport from sending log messages to stderr unless
-     * Log_destination permits.  We don't do this until the postmaster is
-     * fully launched, since startup failures may as well be reported to
-     * stderr.
-     *
-     * If we are in fact disabling logging to stderr, first emit a log message
-     * saying so, to provide a breadcrumb trail for users who may not remember
-     * that their logging is configured to go somewhere else.
-     */
-    if (!(Log_destination & LOG_DESTINATION_STDERR))
-        ereport(LOG,
-                (errmsg("ending log output to stderr"),
-                 errhint("Future log output will go to log destination \"%s\".",
-                         Log_destination_string)));
-
-    whereToSendOutput = DestNone;
-
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1757,11 +1695,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2647,8 +2580,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2980,8 +2911,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3048,13 +2977,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3129,22 +3051,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3603,22 +3509,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3814,8 +3704,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3850,8 +3738,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4052,8 +3939,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5026,18 +4911,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5150,12 +5023,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6044,7 +5911,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6100,8 +5966,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6336,7 +6200,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d7d733530f..c8c1ed34d0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(int port)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(int port)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bc1aa88322..b9c33d6044 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index ffd84d877c..cc4dab9182 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3172,6 +3172,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3746,6 +3752,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4186,9 +4193,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4223,7 +4238,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4231,6 +4246,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..9c694f20c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e9f72b5069..731ef0e27c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -633,6 +634,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 0b49b63327..00d939991d 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b3f00e1943..c0bbf8a7d5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -14,10 +14,11 @@
 #include "datatype/timestamp.h"
 #include "fmgr.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +42,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +52,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +89,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,236 +147,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -423,38 +165,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +204,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -615,16 +245,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -663,7 +306,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +321,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +337,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -780,7 +423,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1215,6 +857,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1236,29 +880,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 08e0dc8144..30d5fb63c5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From dbe5c795e6c75f4a9ee8900c98c6c3108099ce4e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 src/test/perl/PostgresNode.pm                 |  4 ---
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 9d4c000df0..bbae70221e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1145,8 +1145,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c91e3e1550..b85f6f421a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6827,25 +6827,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 990995c17b..2a2adaa0f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1047c77a63..b47f8d084e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 514ea78a68..64e1374288 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -108,15 +108,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57f17e1418..c854d7e193 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -230,11 +230,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -265,13 +262,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fc463601ff..05af81ac47 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4085,17 +4084,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11368,35 +11356,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index cfad86c02a..29b7ebca46 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -562,7 +562,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c0bbf8a7d5..2c7c799ca9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,7 +31,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 6019f37f91..a107683c0d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -451,10 +451,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.16.3

#! /usr/bin/perl

use strict;
use IPC::Open2;
use Time::HiRes  qw( usleep ualarm gettimeofday tv_interval );
use Errno qw/ECHILD/;

my $totaldbs = 100;
my $totaltbls = 100;
my $ndbs = 50;
my $ntbls = 100;

my $loops = 20;
my $chiter = 20000;
my $chtrlen = 1000;
my $nprocs = 90;

my $trg_file = '/tmp/dbrun_pl.trg';
my $refresult_file = '/tmp/refresult.txt';

#       title  loops,ndbs,clnts,ntbls,childiter   ,  xactlen,refprocs
# testrun("A1", $loops,   1,    1,    1,  $chiter*10, $chtrlen,     0);
# testrun("A2", $loops,   1,    1,    1,  $chiter*10, $chtrlen,     1);
# testrun("B1", $loops,   1,   90,    1,  $chiter   , $chtrlen,     0);
# testrun("B2", $loops,   1,   90,    1,  $chiter   , $chtrlen,     1);
# testrun("C1", $loops,   1,   90,   50,  $chiter   , $chtrlen,     0);
# testrun("C2", $loops,   1,   90,   50,  $chiter   , $chtrlen,    10);
# testrun("D1", $loops,  50,   90,    1,  $chiter   , $chtrlen,     0);
# testrun("D2", $loops,  50,   90,    1,  $chiter   , $chtrlen,     1);
# testrun("E1", $loops,  50,   90,    1,  $chiter   , $chtrlen,    10);
# testrun("F1", $loops ,  1,   10,    1,  $chiter   , $chtrlen,    90);
testrun("G1", 10,   1,  400, 1000,  20000, 1000,     0);
testrun("G2", 10,   1,  400, 1000,  20000, 1000,     1);
exit;

sub testrun
{
    my ($test_name, $loops, $ndbs, $nprocs, $ntbls,
    $childiter, $childtrlen, $nreferers)
    = @_;

    my @results = ();
    my @refresults = ();

    # run each iteration
    for (my $l = 0 ; $l < $loops ; $l++)
    {
    my %starttime = ();
    my %endtime = ();

    # This file is used for stopping free-running processes.
    open(OUT, '>', $trg_file) || die "failed to open file:$!\n";
    print OUT "$$\n";
    close(OUT);

    # It's very mysterious that one dummy subprocess makes things
    # stable. Perl runs slowly if there's only one psql process
    # without this...
    {
        my $pid = fork;
        if ($pid == 0)
        {
        while (-f $trg_file)
        {
            usleep (500000);
        }
        exit;
        }
    }

    # start referer processes, collecting how many times the query ran.
    pipe(my $refresrd, my $refreswr);
    for(my $i = 0 ; $i < $nreferers ; $i++)
    {
        my $pid = fork;

        if ($pid < 0) { die "fork referer process failed : $!\n"; }
        
        if ($pid == 0)
        {
        my $pid = open2(my $psqlrd, my $psqlwr, "psql postgres");
        my $count = 0;

        close($refresrd);

        if ($pid < 0) { die "fork psql failed : $!\n"; }

        while (-f $trg_file)
        {
            print $psqlwr "select * from pg_stat_user_tables;\n";
            while (<$psqlrd>)
            {
            last if ($_ =~ /rows\)/);
            }
            $count++;
        }
        print $refreswr "$count\n";
        exit;
        }
    }
    close($refreswr);

    # launch updator processes
    for (my $i = 0 ; $i < $nprocs ; $i++)
    {
        #my $dbn = rand($ndbs);  # equal dist.
        my $dbn = $i % $totaldbs;   # round robbin
        my $dbname = sprintf("db%03d", $dbn);

        my $pid = fork;

        if ($pid < 0)
        {
        die "fork failed: $!\n";
        }
        elsif ($pid == 0)
        {
        my $pid = open2(my $rd, my $wr, "psql $dbname > /dev/null");
        if ($pid < 0) { die "sub fork failed: $!\n"; }

        my $ncmd = $childtrlen;
        #print $wr "set log_min_duration_statement to 0;\n";
        print $wr "begin;\n";
        for (my $i = 0 ; $i < $childiter ; $i++)
        {
            #my $tbn = rand($ntbls);  # equal dist.
            my $tbn = $i % $totaltbls;    # round robbin

            printf $wr "select /* $dbname\[$i\]*/ count(*) from t%03d;\n", $tbn;
            if (--$ncmd == 0)
            {
            print $wr "commit;begin;\n";
            $ncmd = $childtrlen;
            }
        }

        print $wr "commit;\n";
        print $wr "\\q\n";
        my $res = <$rd>;
        exit;
        }

        (my $sec, my $usec) = &gettimeofday();
        $starttime{$pid} = $sec * 1000 + $usec / 1000;
        # print "start\[$pid\] = $starttime{$pid}\n";
    }

    # wait for updateors to finish
    for (my $i = 0 ; $i < $nprocs ; $i++)
    {
        my $pid = wait();

        if ($pid < 0) { if ($! != ECHILD) {die "???: $!\n"; }}
        redo if (!defined $starttime{$pid});

        (my $sec, my $usec) = &gettimeofday();
        $endtime{$pid} = $sec * 1000 + $usec / 1000;
        # printf "%d[%d]: %d - %d  = %d ms\n", $i, $pid, $endtime{$pid}, $starttime{$pid}, $endtime{$pid} -
$starttime{$pid};
    }

    my $sum = 0;
    foreach my $pid (keys %starttime)
    {
        $sum += $endtime{$pid} - $starttime{$pid};
        # printf "[%d]: %d ms (%d, %d)\n", $pid, $endtime{$pid} - $starttime{$pid}, $starttime{$pid}, $endtime{$pid};
    }

    push(@results, $sum / $nprocs);

    # kill referers if any (and dummy process)
    unlink($trg_file);
    while (wait() == 0) {}
    my $nrefs = 0;
    my $refcount = 0;
    while(<$refresrd>)
    {
        chomp;
        $refcount += $_;
        $nrefs++;
    }
    close($refresrd);
    
    push (@refresults, $refcount / $nrefs) if ($nrefs > 0);
    }

    # calculate stdev
    (my $updmean, my $updstdev) = stdev(@results);
    (my $refmean, my $refstdev) = stdev(@refresults);

    printf "$test_name (l:%d, d:%d, t:%d, i:%d, tr:%d): %.2f ms (stdev %.2f) / %d updprocs, %.2f refs (stdev %.2f) / %d
refprocs\n",$loops, $ndbs, $ntbls, $childiter, $childtrlen, $updmean, $updstdev, $nprocs, $refmean, $refstdev,
$nreferers;
}

sub stdev
{
    my $sum = 0;
    my $sqsum = 0;
    my $count = $#_ + 1;

    return (0, 0) if $count == 0;

    foreach my $el (@_)
    {
    $sum += 1.0 * $el;
    $sqsum += 1.0 * $el * $el;
    }
    my $mean = $sum / $count;
    my $stdev = sqrt($sqsum / $count - $mean * $mean);

    return ($mean, $stdev);
}



Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2019-Jul-11, Kyotaro Horiguchi wrote:

> Hello. This is v21 of the patch.
> 
> > CF-bot warned that it doesn't work on Windows. I'm experiencing
> > an very painful time to wait for tortoise git is walking slowly
> > as its name suggests. It would be fixed in the next version.
> 
> Found a bug in initialization. StatsShememInit() was placed at a
> wrong place and stats code on child processes accessed
> uninitialized pointer. It is a leftover from the previous shape
> where dsm was activated on postmaster.

This doesn't apply anymore.  Can you please rebase?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190903222805.GA13932@alvherre.pgsql>
> > Found a bug in initialization. StatsShememInit() was placed at a
> > wrong place and stats code on child processes accessed
> > uninitialized pointer. It is a leftover from the previous shape
> > where dsm was activated on postmaster.
> 
> This doesn't apply anymore.  Can you please rebase?

Thanks! I forgot to post rebased version after doing. Here it is.

- (Re)Rebased to the current master.
- Passed all tests for me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 99b86de7e647c74a01fb694c2a868fa24fdf6424 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH v22 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 350f8c0a66..4f0c7ec840 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index fa2e28ff3e..79698a6ad6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From 261e9ed8d118e7b3bce0c5a69a58eacff5b3c7d3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH v22 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 +++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 4f0c7ec840..60a6e3c0bc 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 79698a6ad6..67f7d77f71 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 4647fcbf1ef032bec3090f6e2702e8cc9997ea6b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH v22 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 9238fbe98d..dde2485b14 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 011076c3e3..043e3ff9d2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a5446d54bb..582434252f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1762,7 +1764,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -2991,7 +2993,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3136,10 +3138,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3385,7 +3385,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3590,6 +3590,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3862,6 +3874,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5131,7 +5144,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5414,6 +5427,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..1f4db67f3f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d823d..65713abc2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,6 +718,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From 1a5a0e2bd49d2ec1c4a73c8be7c7d7d390c61a37 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH v22 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5661 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |  139 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2637 insertions(+), 3671 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 828e9084dd..ea6aad4d1e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 073f313337..a222817f55 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 043e3ff9d2..c0b20763b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends interfaces stores activity of
+ *  every backend during a transaction. Then pgstat_flush_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is
+ *  possible that a backend cannot flush all or a part of local numbers
+ *  immediately, we postpone updates and try the next chance after the
+ *  interval of PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept
+ *  longer than PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +38,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +85,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +113,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * BgWriter global statistics counters. The name is the remnant from the time
+ * when the stats collector was a dedicate process, which used sockets to send
+ * it.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +204,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +215,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +312,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -250,23 +325,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached
+ * is stored in this structure and moved around multiple calls and multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +367,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,481 +409,197 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
+     */
+    if (!area)
+    {
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+    }
+
+    MemoryContextSwitchTo(oldcontext);
+
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and
+ *    instructed to write file.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +607,293 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
+
+        /*
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
+ *
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +906,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1259,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1349,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
+
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1450,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1484,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1523,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1572,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1657,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1693,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1857,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2155,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2175,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2208,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2821,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2860,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2934,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3089,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3107,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2836,8 +3324,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3423,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3559,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3598,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3860,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4156,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4815,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4860,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4870,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4890,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5072,865 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5963,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 582434252f..bb438df2fc 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1249,66 +1247,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Forcibly remove the files signaling a standby promotion request.
-     * Otherwise, the existence of those files triggers a promotion too early,
-     * whether a user wants that or not.
-     *
-     * This removal of files is usually unnecessary because they can exist
-     * only during a few moments during a standby promotion. However there is
-     * a race condition: if pg_ctl promote is executed and creates the files
-     * during a promotion, the files can stay around even after the server is
-     * brought up to new master. Then, if new standby starts by using the
-     * backup taken from that master, the files can exist at the server
-     * startup and should be removed in order to avoid an unexpected
-     * promotion.
-     *
-     * Note that promotion signal files need to be removed before the startup
-     * process is invoked. Because, after that, they can be used by
-     * postmaster's SIGUSR1 signal handler.
-     */
-    RemovePromoteSignalFiles();
-
-    /* Do the same for logrotate signal file */
-    RemoveLogrotateSignalFiles();
-
-    /* Remove any outdated file holding the current log filenames. */
-    if (unlink(LOG_METAINFO_DATAFILE) < 0 && errno != ENOENT)
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not remove file \"%s\": %m",
-                        LOG_METAINFO_DATAFILE)));
-
-    /*
-     * If enabled, start up syslogger collection subprocess
-     */
-    SysLoggerPID = SysLogger_Start();
-
-    /*
-     * Reset whereToSendOutput from DestDebug (its starting state) to
-     * DestNone. This stops ereport from sending log messages to stderr unless
-     * Log_destination permits.  We don't do this until the postmaster is
-     * fully launched, since startup failures may as well be reported to
-     * stderr.
-     *
-     * If we are in fact disabling logging to stderr, first emit a log message
-     * saying so, to provide a breadcrumb trail for users who may not remember
-     * that their logging is configured to go somewhere else.
-     */
-    if (!(Log_destination & LOG_DESTINATION_STDERR))
-        ereport(LOG,
-                (errmsg("ending log output to stderr"),
-                 errhint("Future log output will go to log destination \"%s\".",
-                         Log_destination_string)));
-
-    whereToSendOutput = DestNone;
-
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1757,11 +1695,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2646,8 +2579,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -2994,8 +2925,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3062,13 +2991,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3143,22 +3065,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3617,22 +3523,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3828,8 +3718,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3864,8 +3752,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4066,8 +3953,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5040,18 +4925,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5164,12 +5037,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6058,7 +5925,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6114,8 +5980,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6350,7 +6214,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 885370698f..cfb3b91b11 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index fb0bf44264..b423aaaf02 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e8d8e6f828..bec27c3034 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4225,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4233,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..9c694f20c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 29c5ec7b58..66c6a2b1e8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1f4db67f3f..43250c3885 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65713abc2b..c9fbcead3f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,10 +13,11 @@
 
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -40,33 +41,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -77,9 +51,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +164,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,96 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -614,16 +244,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -662,7 +305,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -677,7 +320,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -693,7 +336,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -779,7 +422,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1214,6 +856,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1235,29 +879,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f627dfedc5..97801f4791 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From 1b7207c2069debf4888a3d526554b2354ccca855 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH v22 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 src/test/perl/PostgresNode.pm                 |  4 ---
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7f9ce8fcba..a8bed31232 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6818,25 +6818,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index ea6aad4d1e..33ad2b8be8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c0b20763b0..6b8025ad13 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -108,15 +108,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6aab8d7b5f..2eb49924b9 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -242,11 +242,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -277,13 +274,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 90ffd89339..753e30ebb7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4072,17 +4071,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11352,35 +11340,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..66f539c4bb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -558,7 +558,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c9fbcead3f..e9e18ed27a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -30,7 +30,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..c604c5e90b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.16.3


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2019-Sep-10, Kyotaro Horiguchi wrote:

> At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190903222805.GA13932@alvherre.pgsql>
> > > Found a bug in initialization. StatsShememInit() was placed at a
> > > wrong place and stats code on child processes accessed
> > > uninitialized pointer. It is a leftover from the previous shape
> > > where dsm was activated on postmaster.
> > 
> > This doesn't apply anymore.  Can you please rebase?
> 
> Thanks! I forgot to post rebased version after doing. Here it is.
> 
> - (Re)Rebased to the current master.
> - Passed all tests for me.

This seems to have very trivial conflicts -- please rebase again?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Wed, 25 Sep 2019 18:01:02 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190925210102.GA26396@alvherre.pgsql>
> On 2019-Sep-10, Kyotaro Horiguchi wrote:
> 
> > At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20190903222805.GA13932@alvherre.pgsql>
> > > > Found a bug in initialization. StatsShememInit() was placed at a
> > > > wrong place and stats code on child processes accessed
> > > > uninitialized pointer. It is a leftover from the previous shape
> > > > where dsm was activated on postmaster.
> > > 
> > > This doesn't apply anymore.  Can you please rebase?
> > 
> > Thanks! I forgot to post rebased version after doing. Here it is.
> > 
> > - (Re)Rebased to the current master.
> > - Passed all tests for me.
> 
> This seems to have very trivial conflicts -- please rebase again?

Affected by the code movement in 9a86f03b4e. Just
rebased. Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 70dfe750e365fa9ba15312662b72a13326fb22e3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH v23 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 +++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 350f8c0a66..4f0c7ec840 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index fa2e28ff3e..79698a6ad6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.16.3

From ce97ee23026c410b199358cbe472dff06177dc40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH v23 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 +++++++++++++++++++++++++++++++++++++++++++-----
 src/include/lib/dshash.h |  6 +++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 4f0c7ec840..60a6e3c0bc 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 79698a6ad6..67f7d77f71 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.16.3

From 1db30db706a405b03048194a7c25216149e1abaf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH v23 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++++----------------------------
 src/backend/postmaster/pgstat.c     |  6 +++
 src/backend/postmaster/postmaster.c | 35 +++++++++----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 9238fbe98d..dde2485b14 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 011076c3e3..043e3ff9d2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2934,6 +2934,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4277,6 +4280,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index eb9e0221f8..27a9e45074 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1776,7 +1778,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3005,7 +3007,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3150,10 +3152,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3399,7 +3399,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3604,6 +3604,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3876,6 +3888,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5145,7 +5158,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5428,6 +5441,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..1f4db67f3f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d823d..65713abc2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,6 +718,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.16.3

From c0d5b123c79995507fb9b340723e7152980d158b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH v23 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 5661 ++++++++++++--------------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   27 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2637 insertions(+), 3617 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 828e9084dd..ea6aad4d1e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 073f313337..a222817f55 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1958,15 +1958,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2749,12 +2749,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 043e3ff9d2..c0b20763b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends interfaces stores activity of
+ *  every backend during a transaction. Then pgstat_flush_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is
+ *  possible that a backend cannot flush all or a part of local numbers
+ *  immediately, we postpone updates and try the next chance after the
+ *  interval of PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept
+ *  longer than PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "pgstat.h"
 
@@ -42,66 +38,38 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
-
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -117,6 +85,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -132,31 +113,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
+#define        StatsLock (&StatsShmem->StatsMainLock)
+
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
+
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
+ * BgWriter global statistics counters. The name is the remnant from the time
+ * when the stats collector was a dedicate process, which used sockets to send
+ * it.
  */
-PgStat_MsgBgWriter BgWriterStats;
+PgStat_MsgBgWriter BgWriterStats = {0};
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
 
-static struct sockaddr_storage pgStatAddr;
 
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -191,8 +204,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -202,6 +215,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -237,11 +312,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -250,23 +325,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached
+ * is stored in this structure and moved around multiple calls and multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -280,35 +367,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -316,481 +409,197 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
  */
-void
-pgstat_init(void)
+Size
+StatsShmemSize(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
-
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
-    }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
-    {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    return sizeof(StatsShmemStruct);
 }
 
 /*
- * subroutine for pgstat_reset_all
+ * StatsShmemInit - initialize during shared-memory creation
+ */
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
+
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_attach_shared_stats(void)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    MemoryContext oldcontext;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * Don't use dsm under postmaster, when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
+     */
+    if (!area)
+    {
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+    }
+
+    MemoryContextSwitchTo(oldcontext);
+
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
+
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and
+ *    instructed to write file.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
-
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -798,75 +607,293 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
+
+    if (!force)
+    {
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
+
+        /*
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
+        }
+    }
+
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
+    }
+
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
+ *
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
+ */
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
+{
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
+
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*cxt->dshash, &key, false);
+
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
+ */
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    TabStatusArray *tsa;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
+
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -879,178 +906,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
+    /*
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
+     */
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+/*
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
+ */
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
+
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
+
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        int op = PGSTAT_EXCLUSIVE;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
+    hash_seq_init(&fstat, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    {
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
+
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
+
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1058,137 +1259,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Send the rest
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (msg.m_nentries > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1242,66 +1349,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
+ *
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
+ */
+void
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
+{
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
+
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
+ *
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
+
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
+
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1310,20 +1450,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1332,29 +1484,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,17 +1523,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1387,48 +1572,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1439,9 +1657,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1470,78 +1693,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    BeDBStats.n_deadlocks++;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
+/*
+ * flush dead lock stats
  */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1559,60 +1857,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush temporary file stats
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
 }
 
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
+ *
+ *    Tell the collector about one or more checksum failures.
+ * --------
+ */
+void
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+{
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
+        return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush checkpoint failure count for all databases
+ */
+static void
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
+{
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
+
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1764,7 +2155,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1783,6 +2175,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1798,18 +2208,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2422,30 +2821,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2458,51 +2860,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2517,21 +2934,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2603,9 +3089,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2620,9 +3107,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2836,8 +3324,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2935,7 +3423,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3071,6 +3559,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3106,6 +3598,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3366,7 +3860,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3661,9 +4156,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4323,75 +4815,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
-/* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
 /* ----------
  * pgstat_send_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
 pgstat_send_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4400,6 +4860,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4408,11 +4870,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4421,305 +4890,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
+/*
+ * Pin and Unpin dbentry.
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Save the final stats to reuse at next startup.
+     * It is isolated, waiting for all referrers to end.
      */
-    pgstat_write_statsfiles(true, true);
+    Assert(dbentry->generation == generation + 1);
 
-    exit(0);
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
+
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
+
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
+
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
+
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
+
+    return;
 }
 
-
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret;
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    errno = save_errno;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
+
+    return ret;
 }
 
-/* SIGHUP handler for collector process */
-static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
 {
-    int            save_errno = errno;
+    dshash_table *ret = NULL;
 
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
 
-    errno = save_errno;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
+
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4744,72 +5072,865 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
+
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
+
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
+
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    LWLockRelease(&dbentry->lock);
 }
 
 /*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    int            printed;
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
+}
 
-    if (!create && !found)
-        return NULL;
+/* ----------
+ * pgstat_write_statsfiles() -
+ *        Write the global statistics file, as well as DB files.
+ * ----------
+ */
+void
+pgstat_write_statsfiles(void)
+{
+    dshash_seq_status hstat;
+    PgStat_StatDBEntry *dbentry;
+    FILE       *fpout;
+    int32        format_id;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    int            rc;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Open the statistics temp file to write out the current values.
      */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Set the timestamp of the stats file.
+     */
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write global stats struct
+     */
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write archiver stats struct
+     */
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database table.
+     */
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
+    {
+        /*
+         * Write out the table and function stats for this DB into the
+         * appropriate per-DB stat file, if required.
+         */
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
+
+        pgstat_write_pgStatDBHashfile(dbentry);
+
+        /*
+         * Write out the DB entry. We don't write the tables or functions
+         * pointers, since they're of no use to any other process.
+         */
+        fputc('D', fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_write_pgStatDBHashfile() -
+ *        Write the stat file for a single database.
+ * ----------
+ */
+static void
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatFuncEntry *funcentry;
+    FILE       *fpout;
+    int32        format_id;
+    Oid            dbid = dbentry->databaseid;
+    int            rc;
+    char        tmpfile[MAXPGPATH];
+    char        statfile[MAXPGPATH];
+    dshash_table *tbl;
+
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
+
+    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+
+    /*
+     * Open the statistics temp file to write out the current values.
+     */
+    fpout = AllocateFile(tmpfile, PG_BINARY_W);
+    if (fpout == NULL)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not open temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        return;
+    }
+
+    /*
+     * Write the file header --- currently just a format ID.
+     */
+    format_id = PGSTAT_FILE_FORMAT_ID;
+    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Walk through the database's access stats per table.
+     */
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
+    {
+        fputc('T', fpout);
+        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+        (void) rc;                /* we'll check for error with ferror */
+    }
+    dshash_detach(tbl);
+
+    /*
+     * Walk through the database's function stats table.
+     */
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
+    }
+
+    /*
+     * No more output to be done. Close the temp file and replace the old
+     * pgstat.stat with it.  The ferror() check replaces testing for error
+     * after each individual fputc or fwrite above.
+     */
+    fputc('E', fpout);
+
+    if (ferror(fpout))
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not write temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        FreeFile(fpout);
+        unlink(tmpfile);
+    }
+    else if (FreeFile(fpout) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not close temporary statistics file \"%s\": %m",
+                        tmpfile)));
+        unlink(tmpfile);
+    }
+    else if (rename(tmpfile, statfile) < 0)
+    {
+        ereport(LOG,
+                (errcode_for_file_access(),
+                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+                        tmpfile, statfile)));
+        unlink(tmpfile);
+    }
+}
+
+/* ----------
+ * pgstat_read_statsfiles() -
+ *
+ *    Reads in existing statistics collector files into the shared stats hash.
+ *
+ * ----------
+ */
+void
+pgstat_read_statsfiles(void)
+{
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatDBEntry dbbuf;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
+
+    /*
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
+     */
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * Read global stats struct
+     */
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        goto done;
+    }
+
+    /*
+     * Read archiver stats struct
+     */
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        goto done;
+    }
+
+    /*
+     * We found an existing collector stats file. Read it and put all the
+     * hashtable entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'D'    A PgStat_StatDBEntry struct describing a database
+                 * follows.
+                 */
+            case 'D':
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /*
+                 * Add to the DB hash
+                 */
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
+                if (found)
+                {
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
+
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
+
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
+                break;
+
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
+}
+
+
+/* ----------
+ * pgstat_read_pgStatDBHashfile() -
+ *
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
+ * ----------
+ */
+static void
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
+{
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatTabEntry tabbuf;
+    PgStat_StatFuncEntry funcbuf;
+    PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
+    FILE       *fpin;
+    int32        format_id;
+    bool        found;
+    char        statfile[MAXPGPATH];
+
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
+
+    /*
+     * Try to open the stats file. If it doesn't exist, the backends simply
+     * return zero for anything and the collector simply starts from scratch
+     * with empty counters.
+     *
+     * ENOENT is a possibility if the stats collector is not running or has
+     * not yet written the stats file the first time.  Any other failure
+     * condition is suspicious.
+     */
+    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+    {
+        if (errno != ENOENT)
+            ereport(LOG,
+                    (errcode_for_file_access(),
+                     errmsg("could not open statistics file \"%s\": %m",
+                            statfile)));
+        return;
+    }
+
+    /*
+     * Verify it's of the expected format.
+     */
+    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+        format_id != PGSTAT_FILE_FORMAT_ID)
+    {
+        ereport(LOG,
+                (errmsg("corrupted statistics file \"%s\"", statfile)));
+        goto done;
+    }
+
+    /*
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
+     */
+    for (;;)
+    {
+        switch (fgetc(fpin))
+        {
+                /*
+                 * 'T'    A PgStat_StatTabEntry follows.
+                 */
+            case 'T':
+                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
+                          fpin) != sizeof(PgStat_StatTabEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (tabhash == NULL)
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
+
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
+
+                /* don't allow duplicate entries */
+                if (found)
+                {
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
+                break;
+
+                /*
+                 * 'F'    A PgStat_StatFuncEntry follows.
+                 */
+            case 'F':
+                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
+                          fpin) != sizeof(PgStat_StatFuncEntry))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                if (funchash == NULL)
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
+
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
+
+                if (found)
+                {
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
+
+                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
+                break;
+
+                /*
+                 * 'E'    The EOF marker of a complete stats file.
+                 */
+            case 'E':
+                goto done;
+
+            default:
+                ereport(LOG,
+                        (errmsg("corrupted statistics file \"%s\"",
+                                statfile)));
+                goto done;
+        }
+    }
+
+done:
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
+
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+}
+
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+/* ----------
+ * pgstat_clear_snapshot() -
+ *
+ *    Discard any data collected in the current transaction.  Any subsequent
+ *    request will cause new snapshots to be read.
+ *
+ *    This is also invoked during transaction commit or abort to discard
+ *    the no-longer-wanted snapshot.
+ * ----------
+ */
+void
+pgstat_clear_snapshot(void)
+{
+    /* Release memory, if any was allocated */
+    if (pgStatLocalContext)
+    {
+        MemoryContextDelete(pgStatLocalContext);
+
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
+
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
+{
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
+
+    if (tabhash == NULL)
+        return false;
+
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
+
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
     if (!found)
-        reset_dbentry_counters(result);
+    {
+        /*
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
+         */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
+    }
+    else
+    {
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
+        {
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
+        }
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
+    }
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
+
+static void
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
+{
+    /*
+     * Add per-table stats to the per-database entry, too.
+     */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
+
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
+
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
+        {
+            /*
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
+             */
+            init_dbentry(result);
+            reset_dbentry_counters(result);
+        }
+    }
+    else
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
+    }
+
+    /* Set return status if requested */
+    if (status)
+    {
+        if (!lock_acquired)
+        {
+            Assert(nowait);
+            *status = LOCK_FAILED;
+        }
+        else if (!found)
+            *status = NOT_FOUND;
+        else
+            *status = FOUND;
+    }
 
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4842,1702 +5963,6 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
     return result;
 }
 
-
-/* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
- * ----------
- */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
-{
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    FILE       *fpout;
-    int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            rc;
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Set the timestamp of the stats file.
-     */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write global stats struct
-     */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write archiver stats struct
-     */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database table.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
- * ----------
- */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return dbhash;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
-
-
-/* ----------
- * pgstat_clear_snapshot() -
- *
- *    Discard any data collected in the current transaction.  Any subsequent
- *    request will cause new snapshots to be read.
- *
- *    This is also invoked during transaction commit or abort to discard
- *    the no-longer-wanted snapshot.
- * ----------
- */
-void
-pgstat_clear_snapshot(void)
-{
-    /* Release memory, if any was allocated */
-    if (pgStatLocalContext)
-        MemoryContextDelete(pgStatLocalContext);
-
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 27a9e45074..d4a590fa5a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1317,12 +1315,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1771,11 +1763,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2660,8 +2647,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3008,8 +2993,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3076,13 +3059,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3157,22 +3133,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3631,22 +3591,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3842,8 +3786,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3878,8 +3820,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4080,8 +4021,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5054,18 +4993,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5178,12 +5105,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6072,7 +5993,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6128,8 +6048,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6364,7 +6282,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 885370698f..cfb3b91b11 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index fb0bf44264..b423aaaf02 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e8d8e6f828..bec27c3034 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3159,6 +3159,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4173,9 +4180,17 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
-                ProcessCompletedNotifies();
-                pgstat_report_stat(false);
+                long stats_timeout;
 
+                ProcessCompletedNotifies();
+
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4210,7 +4225,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4218,6 +4233,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..9c694f20c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 29c5ec7b58..66c6a2b1e8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -74,6 +74,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1f4db67f3f..43250c3885 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65713abc2b..c9fbcead3f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,10 +13,11 @@
 
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -40,33 +41,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -77,9 +51,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
 /* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +164,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,96 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -614,16 +244,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -662,7 +305,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -677,7 +320,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -693,7 +336,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -779,7 +422,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1214,6 +856,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1235,29 +879,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f627dfedc5..97801f4791 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 9244a2a7b7..a9b625211b 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.16.3

From eda37e6344f6f848234e4ee79563b629a76737e6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH v23 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 --
 doc/src/sgml/config.sgml                      | 19 -------------
 doc/src/sgml/monitoring.sgml                  |  7 +----
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 ++++-----
 src/backend/replication/basebackup.c          | 13 ++-------
 src/backend/utils/misc/guc.c                  | 41 ---------------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 +++-
 src/test/perl/PostgresNode.pm                 |  4 ---
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6612f95f9f..b346809c11 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6818,25 +6818,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index ea6aad4d1e..33ad2b8be8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c0b20763b0..6b8025ad13 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -108,15 +108,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d0f210de8c..39fcf29ff2 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -242,11 +242,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -277,13 +274,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2178e1cf5e..50625421ab 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -194,7 +194,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4072,17 +4071,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11352,35 +11340,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0fc23e3a61..66f539c4bb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -558,7 +558,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c9fbcead3f..e9e18ed27a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -30,7 +30,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..c604c5e90b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.16.3


Re: shared-memory based stats collector

От
Michael Paquier
Дата:
On Fri, Sep 27, 2019 at 09:46:47AM +0900, Kyotaro Horiguchi wrote:
> Affected by the code movement in 9a86f03b4e. Just
> rebased. Thanks.

This does not apply anymore.  Could you provide a rebase?  I have
moved the patch to next CF, waiting on author.

Thanks,
--
Michael

Вложения

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Sun, 1 Dec 2019 11:12:32 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Sep 27, 2019 at 09:46:47AM +0900, Kyotaro Horiguchi wrote:
> > Affected by the code movement in 9a86f03b4e. Just
> > rebased. Thanks.
> 
> This does not apply anymore.  Could you provide a rebase?  I have
> moved the patch to next CF, waiting on author.

Thanks! Rebased.

# I should design then run a performance test on this..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 761b0c55e88acc90c143d29a7d53dc6bb0495b7b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH v24 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 188 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  23 ++++-
 2 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 350f8c0a66..4f0c7ec840 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,168 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Take lock on the next partition then release the current,
+                 * not in the reverse order. This is required to avoid
+                 * resizing from happening during a sequential scan. Locks are
+                 * taken in partition order so no dead lock happen with other
+                 * seq scans or resizing.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * This item can be deleted by the caller. Store the next item for the
+     * next iteration for the occasion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index fa2e28ff3e..79698a6ad6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,23 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state of dshash. The detail is exposed since the storage
+ * size should be known to users but it should be considered as an opaque
+ * type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +87,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +96,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.23.0

From 54d24757f2ddac318bb781137972321d819d22a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH v24 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 69 ++++++++++++++++++++++++++++++++++++----
 src/include/lib/dshash.h |  6 ++++
 2 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 4f0c7ec840..60a6e3c0bc 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,48 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * Addition to dshash_find, returns immediately when nowait is true and lock
+ * was not acquired. Lock status is set to *lock_failed if any.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /* allowing !nowait returning the result is just not sensible */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +470,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * Addition to dshash_find_or_insert, returns NULL if nowait is true and lock
+ * was not acquired.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +500,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -626,9 +679,11 @@ dshash_memhash(const void *v, size_t size, void *arg)
  * As opposed to the equivalent for dynanash, the caller is not supposed to
  * delete the returned element before continuing the scan.
  *
- * If consistent is set for dshash_seq_init, the whole hash table is
- * non-exclusively locked. Otherwise a part of the hash table is locked in the
- * same mode (partition lock).
+ * If consistent is set for dshash_seq_init, the all hash table
+ * partitions are locked in the requested mode (as determined by the
+ * exclusive flag), and the locks are held until the end of the scan.
+ * Otherwise the partition locks are acquired and released as needed
+ * during the scan (up to two partitions may be locked at the same time).
  */
 void
 dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 79698a6ad6..67f7d77f71 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -90,8 +90,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.23.0

From c4b747064b46f18de7d212a41210952ea27e3c5c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH v24 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++----------------------
 src/backend/postmaster/pgstat.c     |  6 ++
 src/backend/postmaster/postmaster.c | 35 ++++++++---
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 67 insertions(+), 87 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 8ea033610d..6e38f9a3d2 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -329,6 +329,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case CheckpointerProcess:
                 statmsg = pgstat_get_backend_desc(B_CHECKPOINTER);
                 break;
@@ -456,6 +459,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             BackgroundWriterMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case CheckpointerProcess:
             /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f84f882c4c..4342ebdab4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -77,7 +77,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -96,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void ArchSigHupHandler(SIGNAL_ARGS);
 static void ArchSigTermHandler(SIGNAL_ARGS);
@@ -114,75 +112,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -222,8 +151,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -255,8 +184,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * backend.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGHUP signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31de8..8299d2a435 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2932,6 +2932,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4275,6 +4278,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9ff2832c00..84fda38249 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3042,7 +3044,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3187,10 +3189,8 @@ reaper(SIGNAL_ARGS)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3438,7 +3438,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3643,6 +3643,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3915,6 +3927,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5190,7 +5203,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5475,6 +5488,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bc6e03fbc7..1f4db67f3f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -399,6 +399,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -411,6 +412,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d823d..65713abc2b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,6 +718,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 2474eac26a..88f16863d4 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.23.0

From ecea667e9180e80f9f860326056d6180696c04bd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH v24 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 4652 ++++++++----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    1 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2132 insertions(+), 3112 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86b7e..eb94dec119 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c1dd8168ca..e07c7cb311 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1961,15 +1961,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2752,12 +2752,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8299d2a435..bcf8c6f371 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends interfaces stores activity of
+ *  every backend during a transaction. Then pgstat_flush_stat() is called at
+ *  the end of a transaction to flush out the local numbers to shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). Still it is
+ *  possible that a backend cannot flush all or a part of local numbers
+ *  immediately, we postpone updates and try the next chance after the
+ *  interval of PGSTAT_STAT_RETRY_INTERVAL(100ms), but they are not kept
+ *  longer than PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,66 +36,39 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "postmaster/fork_process.h"
-#include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum time between stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval between after
+                                         * elapsed PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Maximum time between stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -115,6 +84,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -130,31 +112,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+#define        StatsLock (&StatsShmem->StatsMainLock)
 
-static time_t last_pgstat_start_time;
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock protecting this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats collector */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
 
-static bool pgStatRunningInCollector = false;
+/*
+ * BgWriter global statistics counters. The name is the remnant from the time
+ * when the stats collector was a dedicate process, which used sockets to send
+ * it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};
+
+/* Variables lives for the backend lifetime */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
+
+
+/* parameter for each type of shared hash */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -189,8 +203,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -200,6 +214,68 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in backend snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -235,11 +311,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -248,23 +324,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached
+ * is stored in this structure and moved around multiple calls and multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
-
-/* Signal handler flags */
-static volatile bool need_exit = false;
-static volatile bool got_SIGHUP = false;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -278,35 +366,41 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
-static void pgstat_sighup_handler(SIGNAL_ARGS);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -314,557 +408,491 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool    found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
+        Assert(!found);
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
+ */
+static void
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
+    MemoryContextSwitchTo(oldcontext);
 
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
 
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and
+ *    instructed to write file.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    LWLockRelease(StatsLock);
 
-    return;
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
+}
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
  */
-static void
-pgstat_reset_remove_files(const char *directory)
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
+ *    transaction stop time as an approximation of current time.
+ *    ----------
+ */
+long
+pgstat_report_stat(bool force)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    TimestampTz now;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+
+    /* Don't expend a clock check if nothing to do */
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    if (!force)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
 
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
+        if (pending_since > 0)
         {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
         }
+    }
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
     }
-    FreeDir(dir);
-}
 
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
 
-#ifdef EXEC_BACKEND
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
 
-    return postmaster_forkexec(ac, av);
+    return 0;
 }
-#endif                            /* EXEC_BACKEND */
-
 
 /*
- * pgstat_start() -
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
  *
- *    Returns PID of child process, or 0 if fail.
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
  */
-int
-pgstat_start(void)
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
 
     /*
-     * Okay, fork off the collector.
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+    if (!*cxt->hash)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+        HASHCTL ctl;
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
 
-        default:
-            return (int) pgStatPid;
-    }
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
 
-    /* shouldn't get here */
-    return 0;
-}
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
+        sentry = dshash_find(*cxt->dshash, &key, false);
 
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
 
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
 
-/* ----------
- * pgstat_report_stat() -
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
  *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
  */
-void
-pgstat_report_stat(bool force)
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
 {
-    /* we assume this inits to all zeroes: */
     static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    /* nothing to do, just return  */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -877,178 +905,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
+
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
+
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
+
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
+    {
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return  */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
+
+    /* get dbentry into cxt if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+        if (cxt->mydbentry == NULL)
+            return false;
 
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
+
+    have_function_stats = false;
+
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
     hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    }
 
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet.  */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1056,137 +1258,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1240,66 +1348,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
-
-/* ----------
- * pgstat_drop_database() -
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
  */
 void
-pgstat_drop_database(Oid databaseid)
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
 {
-    PgStat_MsgDropdb msg;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
 
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
- * pgstat_drop_relation() -
+ * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
  *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
-#ifdef NOT_USED
 void
-pgstat_drop_relation(Oid relid)
+pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgTabpurge msg;
-    int            len;
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
 }
-#endif                            /* NOT_USED */
-
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1308,20 +1449,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!pgStatDBHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
+        return;
+
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1330,29 +1483,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,18 +1522,43 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
-}
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
+}
 
 /* ----------
  * pgstat_report_autovac() -
@@ -1385,48 +1571,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1437,9 +1656,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1468,78 +1692,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
+    BeDBStats.n_deadlocks++;
 
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
 
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
 
-    pgstat_send(&msg, sizeof(msg));
+/*
+ * flush dead lock stats
+ */
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1557,60 +1856,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (BeDBStats.n_tmpfiles == 0)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
+/*
+ * flush temporary file stats
+ */
+static void
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
+{
 
-/* ----------
- * pgstat_ping() -
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
+}
+
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Tell the collector about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
         return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush checkpoint failure count for all databases
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;
 
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1762,7 +2154,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1781,6 +2174,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1796,18 +2207,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2420,30 +2820,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members  */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2456,51 +2859,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2515,21 +2933,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends  */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2601,9 +3088,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2618,9 +3106,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2834,8 +3323,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2933,7 +3422,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3069,6 +3558,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3104,6 +3597,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3364,7 +3859,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3659,9 +4155,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4321,75 +4814,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_send_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_send_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
+    TimestampTz now = GetCurrentTimestamp();
 
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+}
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4398,6 +4859,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4406,11 +4869,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4419,305 +4889,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
+/*
+ * Pin and Unpin dbentry.
  *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, pgstat_sighup_handler);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
 
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do recognize got_SIGHUP inside
-     * the inner loop, which means that such interrupts will get serviced but
-     * the latch won't get cleared until next time there is a break in the
-     * action.
-     */
-    for (;;)
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (need_exit)
-            break;
+    /*
+     * It is isolated, waiting for all referrers to end.
+     */
+    Assert(dbentry->generation == generation + 1);
 
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * need_exit becomes set.
-         */
-        while (!need_exit)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (got_SIGHUP)
-            {
-                got_SIGHUP = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
+    if (--dbentry->prev_refcnt > 0)
+    {
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
 
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
 
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
 
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
+    return;
+}
 
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
 
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
 
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
+    return ret;
+}
 
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
 
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
 
-    exit(0);
-}
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
 
-/* SIGQUIT signal handler for collector process */
-static void
-pgstat_exit(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    LWLockRelease(&dbent->lock);
 
-    errno = save_errno;
+    return  ret;
 }
 
-/* SIGHUP handler for collector process */
 static void
-pgstat_sighup_handler(SIGNAL_ARGS)
+init_dbentry(PgStat_StatDBEntry *dbentry)
 {
-    int            save_errno = errno;
-
-    got_SIGHUP = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4742,130 +5071,118 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
 
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
 
-    if (!create && !found)
-        return NULL;
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    return result;
+    LWLockRelease(&dbentry->lock);
 }
 
-
 /*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
+    int            printed;
 
-    return result;
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4884,7 +5201,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4896,39 +5213,37 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_pgStatDBHashfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
          * pointers, since they're of no use to any other process.
          */
         fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
 
@@ -4964,53 +5279,18 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
 }
 
 /* ----------
- * pgstat_write_db_statsfile() -
+ * pgstat_write_pgStatDBHashfile() -
  *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -5019,9 +5299,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5048,23 +5329,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -5099,76 +5387,37 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
+ *    Reads in existing statistics collector files into the shared stats hash.
  *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster  */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
-     * The tables will live in pgStatLocalContext.
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
      */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5182,11 +5431,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5195,7 +5444,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5203,32 +5452,24 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5245,10 +5486,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * follows.
                  */
             case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5257,76 +5498,36 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
 
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5336,45 +5537,35 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
- * pgstat_read_db_statsfile() -
+ * pgstat_read_pgStatDBHashfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5388,7 +5579,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5401,14 +5592,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
      */
     for (;;)
     {
@@ -5421,31 +5612,35 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
                 if (tabhash == NULL)
-                    break;
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5455,31 +5650,34 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
                 if (funchash == NULL)
-                    break;
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5489,7 +5687,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5497,295 +5695,39 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 }
 
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5801,739 +5743,223 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
 
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
 
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
 
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
 
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
 static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
     /*
-     * Update database-wide stats.
+     * Add per-table stats to the per-database entry, too.
      */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
 
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
 
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
         {
             /*
-             * Otherwise add the values to the existing entry.
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
              */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+            init_dbentry(result);
+            reset_dbentry_counters(result);
         }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
     }
     else
     {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
     }
-}
 
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
+    /* Set return status if requested */
+    if (status)
     {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        if (!lock_acquired)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
+            Assert(nowait);
+            *status = LOCK_FAILED;
         }
+        else if (!found)
+            *status = NOT_FOUND;
         else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
+            *status = FOUND;
     }
+
+    return result;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
  */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
-    PgStat_StatDBEntry *dbentry;
-    int            i;
+    PgStat_StatTabEntry *result;
+    bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
+    if (!create && !found)
+        return NULL;
 
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    /* If not found, initialize the new one. */
+    if (!found)
     {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
     }
-}
 
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    return result;
 }
 
 /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 84fda38249..4c0ea0cc23 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2681,8 +2668,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3045,8 +3030,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3113,13 +3096,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3194,22 +3170,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3670,22 +3630,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3881,8 +3825,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3917,8 +3859,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4119,8 +4060,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5093,18 +5032,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5223,12 +5150,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6133,7 +6054,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6189,8 +6109,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6425,7 +6343,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4829953ee6..5093a4a11d 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 18e3843e8b..caa00011a9 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3b85e48333..c518424471 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3181,6 +3181,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3755,6 +3761,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4194,6 +4201,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4206,8 +4215,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4242,7 +4256,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4250,6 +4264,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3bf96de256..9c694f20c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile sig_atomic_t ConfigReloadPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index cc38669a1e..a60bc58b0c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -630,6 +631,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1f4db67f3f..43250c3885 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 65713abc2b..c9fbcead3f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2019, PostgreSQL Global Development Group
  *
@@ -13,10 +13,11 @@
 
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -40,33 +41,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -77,9 +51,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
 /* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +164,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,96 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -614,16 +244,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -662,7 +305,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -677,7 +320,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -693,7 +336,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -779,7 +422,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1214,6 +856,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1235,29 +879,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
-extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void pgstat_reset_all(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1429,11 +1070,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f9450dac90..50d0a0c9dd 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index ae5389ec96..3e02bc6f85 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.23.0

From 4e92947ed997fd13693ee73837240961ccd63bfc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH v24 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/monitoring.sgml                  |  7 +---
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4ec13f3311..389269999d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7004,25 +7004,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index eb94dec119..73cba4e21f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index bcf8c6f371..7fe5c5019a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -107,15 +107,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1fa4551eff..84f7acbc4f 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -241,11 +241,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -276,13 +273,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5fccc9683e..809487ab69 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -195,7 +195,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4114,17 +4113,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11434,35 +11422,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 46a06ffacd..7aeb789b0e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -570,7 +570,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c9fbcead3f..e9e18ed27a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -30,7 +30,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 270bd6c856..c604c5e90b 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.23.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 03 Dec 2019 17:27:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Thanks! Rebased.

CFbots seem unhappy with this. Rebased.

> # I should design then run a performance test on this..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 185 +++++++++++++++++++++++++++++++++++++--
 src/include/lib/dshash.h |  22 ++++-
 2 files changed, 201 insertions(+), 6 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index fbdd941325..7675370058 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));
 
     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }
 
 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called for incomplete scans and otherwise
+ * shoudln't. Finished scans are cleaned up automatically.
+ *
+ * Returned elements are locked as is the case with dshash_find.  However, the
+ * caller must not release the lock.
+ *
+ * Same as dynanash, the caller may delete returned elements midst of a scan.
+ *
+ * If consistent is set for dshash_seq_init, the all hash table partitions are
+ * locked in the requested mode (as determined by the exclusive flag) during
+ * the scan.  Otherwise partitions are locked in one-at-a-time way during the
+ * scan.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Lock the next partition then release the current, not in the
+                 * reverse order to to avoid concurrent resizing. Partitions
+                 * are locked in the same order with resize() so dead locks
+                 * won't happen.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
@@ -673,7 +849,6 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
 /*
  * Grow the hash table if necessary to the requested number of buckets.  The
  * requested size must be double some previously observed size.
- *
  * Must be called without any partition lock held.
  */
 static void
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..ad88f32cdd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,22 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +95,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.23.0

From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.
---
 src/backend/lib/dshash.c | 64 +++++++++++++++++++++++++++++++++++++---
 src/include/lib/dshash.h |  6 ++++
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 7675370058..7766fa7704 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -394,19 +394,51 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  */
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
+{
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * The version of dshash_find, which is allowed to return immediately on lock
+ * failure. Lock status is set to *lock_failed in that case.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
 {
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;
 
+    /*
+     * No need to return lock resut when !nowait. Otherwise the caller may
+     * omit the lock result when NULL is returned.
+     */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
@@ -441,6 +473,22 @@ void *
 dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
+{
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * The version of dshash_find_or_insert, which is allowed to return immediately
+ * on lock failure.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
 {
     dshash_hash hash;
     size_t        partition_index;
@@ -455,8 +503,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index ad88f32cdd..a7d19c6a85 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -89,8 +89,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.23.0

From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++
 src/backend/postmaster/pgarch.c     | 98 +++++++----------------------
 src/backend/postmaster/pgstat.c     |  6 ++
 src/backend/postmaster/postmaster.c | 43 ++++++++-----
 src/include/miscadmin.h             |  2 +
 src/include/pgstat.h                |  1 +
 src/include/postmaster/pgarch.h     |  4 +-
 7 files changed, 70 insertions(+), 92 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index bfc629c753..2f38420313 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -326,6 +326,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case StartupProcess:
                 statmsg = pgstat_get_backend_desc(B_STARTUP);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
@@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             StartupProcessMain();
             proc_exit(1);        /* should never return */
 
+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case BgWriterProcess:
             /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3ca30badb2..6441e69e9a 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -78,7 +78,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,7 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
@@ -111,75 +109,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -219,8 +148,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * process.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }
 
 /* SIGUSR1 signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486bebd..ca5c6376e5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2927,6 +2927,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
 
     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7a92dac525..ee41d7009e 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3042,7 +3044,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3177,20 +3179,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3438,7 +3436,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3643,6 +3641,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3915,6 +3925,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5190,7 +5201,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5475,6 +5486,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 62d64aa0a1..dc7dd1c164 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -397,6 +397,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -409,6 +410,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 36b530bc27..0334213b98 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,6 +718,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.23.0

From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.
---
 doc/src/sgml/monitoring.sgml                 |    6 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/pgstat.c              | 4639 ++++++++----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  441 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 14 files changed, 2147 insertions(+), 3085 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0bfd6151c4..a6b0bdec12 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..8dcb0fb7f7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1956,15 +1956,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ca5c6376e5..1ffe073a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to pulish the local stats on shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,67 +36,42 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum interval of stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval after
+                                         * PGSTAT_MIN_INTERVAL*/
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Longest interval of stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -116,6 +87,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;
 
 /* ----------
  * GUC parameters
@@ -131,31 +115,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+#define        StatsLock (&StatsShmem->StatsMainLock)
 
-static time_t last_pgstat_start_time;
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock to protect this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats data */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;
 
-static bool pgStatRunningInCollector = false;
+/*
+ * BgWriter global statistics counters. The name cntains a remnant from the
+ * time when the stats collector was a dedicate process, which used sockets to
+ * send it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};
+
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
+
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -190,8 +206,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;
 
@@ -201,6 +217,71 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide value.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+
+/* checks if there is any conflict waiting to be flushed out */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+/* checks if there is any database-wide waiting to be flushed out */
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,11 +317,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +330,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached is
+ * stored in this structure and moved around multiple calls of multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;
 
 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -275,33 +372,40 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -309,557 +413,497 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool    found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
+        Assert(!found);
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
+ */
+static void
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
+    MemoryContextSwitchTo(oldcontext);
 
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}
 
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    LWLockRelease(StatsLock);
 
-    return;
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
  */
-static void
-pgstat_reset_remove_files(const char *directory)
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
+ *    transaction stop time as an approximation of current time.
+ *    ----------
+ */
+long
+pgstat_report_stat(bool force)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    TimestampTz now;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+
+    /* Don't expend a clock check if nothing to do */
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    if (!force)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
 
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
+        if (pending_since > 0)
         {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
         }
+    }
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
     }
-    FreeDir(dir);
-}
 
-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }
 
-#ifdef EXEC_BACKEND
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
 
-    return postmaster_forkexec(ac, av);
+    return 0;
 }
-#endif                            /* EXEC_BACKEND */
-
 
 /*
- * pgstat_start() -
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
  *
- *    Returns PID of child process, or 0 if fail.
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
  */
-int
-pgstat_start(void)
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }
 
     /*
-     * Okay, fork off the collector.
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+    if (!*cxt->hash)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+        HASHCTL ctl;
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    /*
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
+     */
+    if (!found || !IsTransactionState())
+    {
+        void *sentry;
 
-        default:
-            return (int) pgStatPid;
-    }
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;
 
-    /* shouldn't get here */
-    return 0;
-}
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
 
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
-}
+        sentry = dshash_find(*cxt->dshash, &key, false);
 
-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);
 
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
 
-/* ----------
- * pgstat_report_stat() -
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
+}
+
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
  *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
  */
-void
-pgstat_report_stat(bool force)
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
 {
-    /* we assume this inits to all zeroes: */
     static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
-
-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    /* nothing to do, just return */
+    if (pgStatTabHash == NULL)
+        return true;
 
     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;
 
     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
 
             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -872,178 +916,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
+
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }
 
+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
+
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
+
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
+    {
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;
 
+    /* nothing to do, just return */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
+
+    /* get dbentry into cxt if not yet */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    have_function_stats = false;
 
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
     hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, funcent);
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    }
 
-    have_function_stats = false;
+    return !have_function_stats;
 }
 
+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1051,137 +1269,43 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
 
     /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
@@ -1235,66 +1359,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }
 
-
-/* ----------
- * pgstat_drop_database() -
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
  */
 void
-pgstat_drop_database(Oid databaseid)
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
 {
-    PgStat_MsgDropdb msg;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
 
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}
 
 /* ----------
- * pgstat_drop_relation() -
+ * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
  *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
-#ifdef NOT_USED
 void
-pgstat_drop_relation(Oid relid)
+pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgTabpurge msg;
-    int            len;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    Assert (OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);
 
+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
+}
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1303,20 +1460,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+
+    if (!pgStatDBHash)
+        return;
+
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1325,29 +1494,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1356,17 +1533,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);
+
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
 
-    pgstat_send(&msg, sizeof(msg));
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
 }
 
 /* ----------
@@ -1380,48 +1582,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1432,9 +1667,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1463,78 +1703,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
+    BeDBStats.n_deadlocks++;
 
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
 
-    pgstat_send(&msg, sizeof(msg));
+/*
+ * flush dead lock stats
+ */
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }
 
 /* --------
@@ -1552,60 +1867,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }
+
+    if (BeDBStats.n_tmpfiles == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
+/*
+ * flush temporary file stats
+ */
+static void
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
+{
+
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
+}
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Tell the collector about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
 
-    if (pgStatSock == PGINVALID_SOCKET)
+        failent->count = failurecount;
         return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush checkpoint failure count for all databases
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
 
+        dbentry->n_checksum_failures += ent->count;
+
+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}
 
 /*
  * Initialize function call usage data.
@@ -1757,7 +2165,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1776,6 +2185,24 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+/*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
@@ -1791,18 +2218,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();
 
     /*
      * Find an entry or create a new one.
@@ -2415,30 +2831,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2451,51 +2870,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+
 
 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2510,21 +2944,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);
 
     return funcentry;
 }
 
+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2596,9 +3099,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2613,9 +3117,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2829,8 +3334,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -2928,7 +3433,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3064,6 +3569,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
@@ -3099,6 +3608,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3359,7 +3870,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3654,9 +4166,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4319,75 +4828,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */
 
-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_send_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_send_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    TimestampTz now = GetCurrentTimestamp();
+
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4396,6 +4873,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;
 
+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4404,11 +4883,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4417,280 +4903,164 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
+/*
+ * Pin and Unpin dbentry.
  *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;
 
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);
 
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
+
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
     /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
+     * It is isolated, waiting for all referrers to end.
      */
-    for (;;)
+    Assert(dbentry->generation == generation + 1);
+
+    if (--dbentry->prev_refcnt > 0)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
+        LWLockRelease(&dbentry->lock);
+        return;
+    }
 
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);
 
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);
 
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);
 
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
+    return;
+}
 
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;
 
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);
 
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
+    return ret;
+}
 
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(
-                                                   &msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(
-                                                   &msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(
-                                                 &msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(
-                                                 &msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;
 
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
 
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
 
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}
 
-    exit(0);
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }
 
 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4715,130 +5085,118 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;
 
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;
 
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
 
-    if (!create && !found)
-        return NULL;
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
 
-    return result;
+    LWLockRelease(&dbentry->lock);
 }
 
-
 /*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
+    int            printed;
 
-    return result;
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4857,7 +5215,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4869,39 +5227,37 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_pgStatDBHashfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
          * pointers, since they're of no use to any other process.
          */
         fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
 
@@ -4937,53 +5293,18 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
 }
 
 /* ----------
- * pgstat_write_db_statsfile() -
+ * pgstat_write_pgStatDBHashfile() -
  *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4992,9 +5313,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5021,23 +5343,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }
 
     /*
@@ -5072,76 +5401,37 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
+ *    Reads in existing statistics collector files into the shared stats hash.
  *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
-     * The tables will live in pgStatLocalContext.
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
      */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5155,11 +5445,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5168,7 +5458,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5176,32 +5466,24 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5218,10 +5500,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * follows.
                  */
             case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5230,76 +5512,36 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);
 
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5309,45 +5551,35 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
- * pgstat_read_db_statsfile() -
+ * pgstat_read_pgStatDBHashfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5361,7 +5593,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5374,14 +5606,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
      */
     for (;;)
     {
@@ -5394,31 +5626,35 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
                 if (tabhash == NULL)
-                    break;
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5428,31 +5664,34 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
                 if (funchash == NULL)
-                    break;
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5462,7 +5701,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5470,295 +5709,39 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 }
 
-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5774,739 +5757,223 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
 
+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }
 
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
 
+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}
 
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
 static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
     /*
-     * Update database-wide stats.
+     * Add per-table stats to the per-database entry, too.
      */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}
 
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
 
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
         {
             /*
-             * Otherwise add the values to the existing entry.
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
              */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+            init_dbentry(result);
+            reset_dbentry_counters(result);
         }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
     }
     else
     {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
     }
-}
 
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
+    /* Set return status if requested */
+    if (status)
     {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        if (!lock_acquired)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
+            Assert(nowait);
+            *status = LOCK_FAILED;
         }
+        else if (!found)
+            *status = NOT_FOUND;
         else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
+            *status = FOUND;
     }
+
+    return result;
 }
 
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
  */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
-    PgStat_StatDBEntry *dbentry;
-    int            i;
+    PgStat_StatTabEntry *result;
+    bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
+    if (!create && !found)
+        return NULL;
 
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    /* If not found, initialize the new one. */
+    if (!found)
     {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
     }
-}
 
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    return result;
 }
 
 /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ee41d7009e..433f86a0de 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2681,8 +2668,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3045,8 +3030,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3113,13 +3096,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3192,22 +3168,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3668,22 +3628,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3879,8 +3823,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3915,8 +3857,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4117,8 +4058,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5091,18 +5030,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5221,12 +5148,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6131,7 +6052,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6187,8 +6107,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6423,7 +6341,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d07ce609d4..6bbb2849b3 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -522,6 +522,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0a6f80963b..73bc18440f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3185,6 +3185,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3759,6 +3765,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4197,6 +4204,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4209,8 +4218,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4245,7 +4259,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4253,6 +4267,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index b1f6291b99..1e548a042c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 8a47dcdcb1..a9e530918d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -630,6 +631,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b7d36b65dd..13be46c172 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index dc7dd1c164..d23372f5bd 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0334213b98..b168e490c8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -13,10 +13,11 @@
 
 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -40,33 +41,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -77,9 +51,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
 /* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -422,38 +164,14 @@ typedef struct PgStat_MsgBgWriter
     PgStat_Counter m_checkpoint_sync_time;
 } PgStat_MsgBgWriter;
 
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,96 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -614,16 +244,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;
 
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -662,7 +305,7 @@ typedef struct PgStat_StatTabEntry
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -677,7 +320,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -693,7 +336,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -779,7 +422,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1216,6 +858,8 @@ extern bool pgstat_track_counts;
 extern int    pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
@@ -1237,29 +881,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
-extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void pgstat_reset_all(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1431,11 +1072,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.23.0

From 12d5b6c4cfc715fa7e55d405a3af8c57be5aca8e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 27 Nov 2018 14:42:12 +0900
Subject: [PATCH 5/5] Remove the GUC stats_temp_directory

The guc used to specifie the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contirb, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward comptibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/monitoring.sgml                  |  7 +---
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 10 files changed, 14 insertions(+), 94 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3ccacd528b..2052fff0a2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7047,25 +7047,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a6b0bdec12..84a9c15422 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1ffe073a1f..34e2e268df 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -110,15 +110,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 #define        StatsLock (&StatsShmem->StatsMainLock)
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index dea8aab45e..07fe94ba2f 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -229,11 +229,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -264,13 +261,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e44f71e991..70672d66b4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -195,7 +195,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4137,17 +4136,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11456,35 +11444,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e1048c0047..5f0ab2a82e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -572,7 +572,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b168e490c8..8496204301 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -30,7 +30,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 2e0cf4a2f3..c127f727ae 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.23.0


Re: shared-memory based stats collector

От
Tom Lane
Дата:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> CFbots seem unhappy with this. Rebased.

It's failing to apply again, so I rebased again.  I haven't
read the code or done any testing beyond check-world.

            regards, tom lane

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03..afaccb1 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };

 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)

+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))

+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)

     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;

     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);

     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);

     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
                                 LW_EXCLUSIVE));

     delete_item(hash_table, item);
-    hash_table->find_locked = false;
-    hash_table->find_exclusively_locked = false;
-    LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+    /* We need to keep partition lock while sequential scan */
+    if (!hash_table->seqscan_running)
+    {
+        hash_table->find_locked = false;
+        hash_table->find_exclusively_locked = false;
+        LWLockRelease(PARTITION_LOCK(hash_table, partition));
+    }
 }

 /*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);

     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -593,6 +611,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
 }

 /*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called for incomplete scans and otherwise
+ * shoudln't. Finished scans are cleaned up automatically.
+ *
+ * Returned elements are locked as is the case with dshash_find.  However, the
+ * caller must not release the lock.
+ *
+ * Same as dynanash, the caller may delete returned elements midst of a scan.
+ *
+ * If consistent is set for dshash_seq_init, the all hash table partitions are
+ * locked in the requested mode (as determined by the exclusive flag) during
+ * the scan.  Otherwise partitions are locked in one-at-a-time way during the
+ * scan.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool consistent, bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->consistent = consistent;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+
+    /*
+     * Protect all partitions from modification if the caller wants a
+     * consistent result.
+     */
+    if (consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+            LWLockAcquire(PARTITION_LOCK(hash_table, i),
+                          exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        }
+        ensure_valid_bucket_pointers(hash_table);
+    }
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        if (!status->consistent)
+        {
+            partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+            LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            status->curpartition = partition;
+
+            /* resize doesn't happen from now until seq scan ends */
+            status->nbuckets =
+                NUM_BUCKETS(status->hash_table->control->size_log2);
+            ensure_valid_bucket_pointers(status->hash_table);
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finsih. */
+            dshash_seq_term(status);
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        if (!status->consistent)
+        {
+            int next_partition =
+                PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                           status->hash_table->size_log2);
+
+            /* Move lock along with partition for the bucket */
+            if (status->curpartition != next_partition)
+            {
+                /*
+                 * Lock the next partition then release the current, not in the
+                 * reverse order to to avoid concurrent resizing. Partitions
+                 * are locked in the same order with resize() so dead locks
+                 * won't happen.
+                 */
+                LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                             next_partition),
+                              status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+                LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                             status->curpartition));
+                status->curpartition = next_partition;
+            }
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->consistent)
+    {
+        int i;
+
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(status->hash_table, i));
+    }
+    else if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
  */
@@ -673,7 +849,6 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
 /*
  * Grow the hash table if necessary to the requested number of buckets.  The
  * requested size must be double some previously observed size.
- *
  * Must be called without any partition lock held.
  */
 static void
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68..ad88f32 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,22 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;

+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                consistent;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
-
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
@@ -80,6 +95,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);

+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool consistent, bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index afaccb1..4ba6354 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -395,18 +395,50 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
+    return dshash_find_extended(hash_table, key, exclusive, false, NULL);
+}
+
+/*
+ * The version of dshash_find, which is allowed to return immediately on lock
+ * failure. Lock status is set to *lock_failed in that case.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool *lock_acquired)
+{
     dshash_hash hash;
     size_t        partition;
     dshash_table_item *item;

+    /*
+     * No need to return lock resut when !nowait. Otherwise the caller may
+     * omit the lock result when NULL is returned.
+     */
+    Assert(nowait || !lock_acquired);
+
     hash = hash_key(hash_table, key);
     partition = PARTITION_FOR_HASH(hash);

     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);

-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
+        {
+            if (lock_acquired)
+                *lock_acquired = false;
+            return NULL;
+        }
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+
+    if (lock_acquired)
+        *lock_acquired = true;
+
     ensure_valid_bucket_pointers(hash_table);

     /* Search the active bucket. */
@@ -442,6 +474,22 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
+    return dshash_find_or_insert_extended(hash_table, key, found, false);
+}
+
+/*
+ * The version of dshash_find_or_insert, which is allowed to return immediately
+ * on lock failure.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+                               const void *key,
+                               bool *found,
+                               bool nowait)
+{
     dshash_hash hash;
     size_t        partition_index;
     dshash_partition *partition;
@@ -455,8 +503,16 @@ dshash_find_or_insert(dshash_table *hash_table,
     Assert(!hash_table->find_locked);

 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (nowait)
+    {
+        if (!LWLockConditionalAcquire(
+                PARTITION_LOCK(hash_table, partition_index),
+                LW_EXCLUSIVE))
+            return NULL;
+    }
+    else
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
+                      LW_EXCLUSIVE);
     ensure_valid_bucket_pointers(hash_table);

     /* Search the active bucket. */
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index ad88f32..a7d19c6 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -89,8 +89,14 @@ extern void dshash_destroy(dshash_table *hash_table);
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait,
+                                  bool *lock_acquired);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_or_insert_extended(dshash_table *hash_table,
+                                            const void *key, bool *found,
+                                            bool nowait);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 657b18e..c4859db 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -326,6 +326,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case StartupProcess:
                 statmsg = pgstat_get_backend_desc(B_STARTUP);
                 break;
+            case ArchiverProcess:
+                statmsg = pgstat_get_backend_desc(B_ARCHIVER);
+                break;
             case BgWriterProcess:
                 statmsg = pgstat_get_backend_desc(B_BG_WRITER);
                 break;
@@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             StartupProcessMain();
             proc_exit(1);        /* should never return */

+        case ArchiverProcess:
+            /* don't set signals, archiver has its own agenda */
+            PgArchiverMain();
+            proc_exit(1);        /* should never return */
+
         case BgWriterProcess:
             /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 3ca30ba..6441e69 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -78,7 +78,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;

 /*
@@ -95,7 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif

-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
@@ -111,75 +109,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */

-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -219,8 +148,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
 static void
 pgarch_exit(SIGNAL_ARGS)
 {
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
+    PG_SETMASK(&BlockSig);
+
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
+
+    /*
+     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+     * process.  This is necessary precisely because we don't clean up our
+     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+     * should ensure the postmaster sees this as a crash, too, but no harm in
+     * being doubly sure.)
+     */
+    exit(2);
 }

 /* SIGUSR1 signal handler for archiver process */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 462b4d7..0f5ba82 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2930,6 +2930,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 lbeentry.st_backendType = B_STARTUP;
                 break;
+            case ArchiverProcess:
+                beentry->st_backendType = B_ARCHIVER;
+                break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
                 break;
@@ -4276,6 +4279,9 @@ pgstat_get_backend_desc(BackendType backendType)

     switch (backendType)
     {
+        case B_ARCHIVER:
+            backendDesc = "archiver";
+            break;
         case B_AUTOVAC_LAUNCHER:
             backendDesc = "autovacuum launcher";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 55187eb..5df9333 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */

 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)

@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */

 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)

         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();

         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3050,7 +3052,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();

@@ -3185,20 +3187,16 @@ reaper(SIGNAL_ARGS)
         }

         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }

@@ -3446,7 +3444,7 @@ CleanupBackend(int pid,

 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3651,6 +3649,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }

+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3923,6 +3933,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5207,7 +5218,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();

         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5492,6 +5503,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f985453..ef4cb6a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -397,6 +397,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -409,6 +410,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 7bc36c6..dfdac00 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -718,6 +718,7 @@ typedef struct PgStat_GlobalStats
  */
 typedef enum BackendType
 {
+    B_ARCHIVER,
     B_AUTOVAC_LAUNCHER,
     B_AUTOVAC_WORKER,
     B_BACKEND,
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b320087..e3ffc63 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);

-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();

 #endif                            /* _PGARCH_H */
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d..11aaef5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index e3a43d3..809ad51 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1956,15 +1956,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);

+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2748,12 +2748,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);

     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 0f5ba82..8ea8e52 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Statistics collector facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to pulish the local stats on shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses stats collector creates the area then load the
+ *  stored stats file if any, and the last process at shutdown writes the
+ *  shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"

 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif

 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,67 +36,42 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"

 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum interval of stats data
+                                         * updates; in milliseconds. */

-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval after
+                                         * PGSTAT_MIN_INTERVAL*/

+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Longest interval of stats data
+                                         * updates; in milliseconds. */

 /* ----------
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512

@@ -116,6 +87,19 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)

+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define    PGSTAT_SHARED        0
+#define    PGSTAT_EXCLUSIVE    1
+#define    PGSTAT_NOWAIT        2
+
+typedef enum PgStat_TableLookupResult
+{
+    NOT_FOUND,
+    FOUND,
+    LOCK_FAILED
+} PgStat_TableLookupResult;

 /* ----------
  * GUC parameters
@@ -131,31 +115,63 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;

-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+#define        StatsLock (&StatsShmem->StatsMainLock)

-static time_t last_pgstat_start_time;
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+    LWLock                StatsMainLock;        /* lock to protect this struct */
+    dsa_handle             stats_dsa_handle;    /* DSA handle for stats data */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer            global_stats;
+    dsa_pointer            archiver_stats;
+    int                    refcount;
+} StatsShmemStruct;

-static bool pgStatRunningInCollector = false;
+/*
+ * BgWriter global statistics counters. The name cntains a remnant from the
+ * time when the stats collector was a dedicate process, which used sockets to
+ * send it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};
+
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
+
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};

 /*
  * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * written to shared memory.
  *
  * NOTE: once allocated, TabStatusArray structures are never moved or deleted
  * for the life of the backend.  Also, we zero out the t_id fields of the
@@ -190,8 +206,8 @@ typedef struct TabStatHashEntry
 static HTAB *pgStatTabHash = NULL;

 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
 static HTAB *pgStatFunctions = NULL;

@@ -201,6 +217,71 @@ static HTAB *pgStatFunctions = NULL;
  */
 static bool have_function_stats = false;

+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid        key;
+    bool    negative;
+    void   *body;                /* end of header part: to keep alignment */
+} PgStat_snapshot;
+
+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+    char           *hash_name;            /* name of the snapshot hash */
+    int                hash_entsize;        /* element size of hash entry */
+    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
+    const dshash_parameters *dsh_params;/* dshash params */
+    HTAB          **hash;                /* points to variable to hold hash */
+    dshash_table  **dshash;                /* ditto for dshash */
+} pgstat_snapshot_param;
+
+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide value.
+ */
+typedef struct BackendDBStats
+{
+    int        n_conflict_tablespace;
+    int        n_conflict_lock;
+    int        n_conflict_snapshot;
+    int        n_conflict_bufferpin;
+    int        n_conflict_startup_deadlock;
+    int        n_deadlocks;
+    size_t    n_tmpfiles;
+    size_t    tmpfilesize;
+    HTAB    *checksum_failures;
+} BackendDBStats;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid    dboid;
+    int    count;
+} ChecksumFailureEnt;
+
+static BackendDBStats BeDBStats = {0};
+
+/* macros to check BeDBStats at once */
+
+/* checks if there is any conflict waiting to be flushed out */
+#define HAVE_PENDING_CONFLICTS() \
+    (BeDBStats.n_conflict_tablespace > 0 ||        \
+     BeDBStats.n_conflict_lock > 0 ||            \
+     BeDBStats.n_conflict_bufferpin > 0 ||        \
+     BeDBStats.n_conflict_startup_deadlock > 0)
+
+/* checks if there is any database-wide waiting to be flushed out */
+#define HAVE_PENDING_DBSTATS()                \
+    (HAVE_PENDING_CONFLICTS() ||        \
+     BeDBStats.n_deadlocks > 0 ||                \
+     BeDBStats.n_tmpfiles > 0 ||                \
+     /* no need to check tmpfilesize */        \
+     BeDBStats.checksum_failures != NULL)
+
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,11 +317,11 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;

-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool    clear_snapshot = false;

 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +330,35 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;

 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Struct for context for pgstat_flush_* functions
+ *
+ * To avoid repeated attach/detch of the same dshash, dshashes once attached is
+ * stored in this structure and moved around multiple calls of multiple
+ * functions. generation here means the value returned by pin_hashes().
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+typedef struct pgstat_flush_stat_context
+{
+    int    shgeneration;                /* "generation" of shdb_tabhash below */
+    PgStat_StatDBEntry *shdbentry;    /* dbentry for shared tables (oid = 0) */
+    dshash_table *shdb_tabhash;        /* tabentry dshash of shared tables */
+
+    int    mygeneration;                /* "generation" of mydb_tabhash below */
+    PgStat_StatDBEntry *mydbentry;    /* dbengry for my database */
+    dshash_table *mydb_tabhash;        /* tabentry dshash of my database */
+} pgstat_flush_stat_context;

 /*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memroy and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;

 /*
  * Total time charged to functions so far in the current backend.
@@ -275,33 +372,40 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, int op,
+                                    PgStat_TableLookupResult *status);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static bool pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                                 PgStat_TableStatus *entry);
+static bool pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
+static void pgstat_update_dbentry(PgStat_StatDBEntry *dbentry,
+                                  PgStat_TableStatus *stat);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);

+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);

 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry);
+static void pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry);
+static HTAB *create_tabstat_hash(void);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);

 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -309,560 +413,497 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);

-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static int pin_hashes(PgStat_StatDBEntry *dbentry);
+static void unpin_hashes(PgStat_StatDBEntry *dbentry, int generation);
+static dshash_table *attach_table_hash(PgStat_StatDBEntry *dbent, int gen);
+static dshash_table *attach_function_hash(PgStat_StatDBEntry *dbent, int gen);
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);

 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */

-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute space needed for stats collector's shared memory
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool    found;

-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);

-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
+        Assert(!found);

-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }

-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    LWLockInitialize(StatsLock, LWTRANCHE_STATS);
+}

-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
+ */
+static void
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;

-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;

-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();

-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;

-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);

-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);

-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);

-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);

-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);

-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);

-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();

-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }

-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);

     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }

-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
+    MemoryContextSwitchTo(oldcontext);

-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
+}

-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
+ */
+static void
+pgstat_detach_shared_stats(bool write_stats)
+{
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
+    {
+        if (write_stats)
+            pgstat_write_statsfiles();
+
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }

-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    LWLockRelease(StatsLock);

-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);

-    return;
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
+}

-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/*
+ * pgstat_reset_all() -
+ *
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
+ */
+void
+pgstat_reset_all(void)
+{
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;

-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    /* we must have shared stats attached */
+    Assert (StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);

-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    /* Startup must be the only user of shared stats */
+    Assert (StatsShmem->refcount == 1);

     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }

-/*
- * subroutine for pgstat_reset_all
+/* ------------------------------------------------------------
+ * Public functions used by backends follow
+ *------------------------------------------------------------
  */
-static void
-pgstat_reset_remove_files(const char *directory)
+
+/* ----------
+ * pgstat_report_stat() -
+ *
+ *    Must be called by processes that performs DML: tcop/postgres.c, logical
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates holded for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
+ *    transaction stop time as an approximation of current time.
+ *    ----------
+ */
+long
+pgstat_report_stat(bool force)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    TimestampTz now;
+    pgstat_flush_stat_context cxt = {0};
+    bool        pending_stats = false;
+    long        elapsed;
+    long        secs;
+    int            usecs;
+
+    /* Don't expend a clock check if nothing to do */
+    if (area == NULL ||
+        ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         !HAVE_PENDING_DBSTATS()  && !have_function_stats))
+        return 0;
+
+    now = GetCurrentTransactionStopTimestamp();

-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    if (!force)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
+        {
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }

         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
+        if (pending_since > 0)
         {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs /1000;
+
+            if(elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                force = true;
         }
+    }

-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    /* Flush out table stats */
+    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+        pending_stats = true;

-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    /* Flush out function stats */
+    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+        pending_stats = true;
+
+    /* Flush out database-wide stats */
+    if (HAVE_PENDING_DBSTATS())
+    {
+        if (!pgstat_flush_dbstats(&cxt, !force))
+            pending_stats = true;
     }
-    FreeDir(dir);
-}

-/*
- * pgstat_reset_all() -
- *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
- */
-void
-pgstat_reset_all(void)
-{
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* Unpin dbentry if pinned */
+    if (cxt.mydb_tabhash)
+    {
+        dshash_detach(cxt.mydb_tabhash);
+        unpin_hashes(cxt.mydbentry, cxt.mygeneration);
+        cxt.mydb_tabhash = NULL;
+        cxt.mydbentry = NULL;
+    }

-#ifdef EXEC_BACKEND
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);

-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
+    /* Record how long we are keepnig pending updats. */
+    if (pending_stats)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;

-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }

-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;

-    return postmaster_forkexec(ac, av);
+    return 0;
 }
-#endif                            /* EXEC_BACKEND */
-

 /*
- * pgstat_start() -
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
  *
- *    Returns PID of child process, or 0 if fail.
+ *  The snapshots are stored in a hash, pointer to which is stored in the
+ *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
+ *  using hash_name, hash_entsize in cxt.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ *  cxt->dshash points to dshash_table for dbstat entries. If not yet
+ *  attached, it is attached using cxt->dsh_handle.
  */
-int
-pgstat_start(void)
+static void *
+snapshot_statentry(pgstat_snapshot_param *cxt, Oid key)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_snapshot *lentry = NULL;
+    size_t keysize = cxt->dsh_params->key_size;
+    size_t dsh_entrysize = cxt->dsh_params->entry_size;
+    bool found;

     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+        }
+    }

     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    if (!*cxt->hash)
+    {
+        HASHCTL ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize        = keysize;
+        ctl.entrysize    = offsetof(PgStat_snapshot, body) + cxt->hash_entsize;
+        ctl.hcxt        = pgStatSnapshotContext;
+        *cxt->hash = hash_create(cxt->hash_name, 32, &ctl,
+                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*cxt->hash, &key, HASH_ENTER, &found);

     /*
-     * Okay, fork off the collector.
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+    if (!found || !IsTransactionState())
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+        void *sentry;

-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*cxt->dshash)
+        {
+            MemoryContext oldcxt;

-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+            Assert (cxt->dsh_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *cxt->dshash =
+                dshash_attach(area, cxt->dsh_params, cxt->dsh_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }

-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+        sentry = dshash_find(*cxt->dshash, &key, false);

-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, dsh_entrysize);
+            dshash_release_lock(*cxt->dshash, sentry);

-        default:
-            return (int) pgStatPid;
+            /* then zero out the local additional space if any */
+            if (dsh_entrysize < cxt->hash_entsize)
+                MemSet((char *)&lentry->body + dsh_entrysize, 0,
+                       cxt->hash_entsize - dsh_entrysize);
+        }
+
+        lentry->negative = !sentry;
     }

-    /* shouldn't get here */
-    return 0;
-}
+    if (lentry->negative)
+        return NULL;

-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    return &lentry->body;
 }

-/* ------------------------------------------------------------
- * Public functions used by backends follow
- *------------------------------------------------------------
- */
-
-
-/* ----------
- * pgstat_report_stat() -
+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
  *
- *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
- *    transaction stop time as an approximation of current time.
- * ----------
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  TabStatusArray to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if all stats info are flushed. Caller must detach dshashes
+ *  stored in cxt after use.
  */
-void
-pgstat_report_stat(bool force)
+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
 {
-    /* we assume this inits to all zeroes: */
     static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
-    TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
     TabStatusArray *tsa;
-    int            i;
+    HTAB           *new_tsa_hash = NULL;
+    TabStatusArray *dest_tsa = pgStatTabList;
+    int                dest_elem = 0;
+    int                i;

-    /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
-
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    /* nothing to do, just return */
+    if (pgStatTabHash == NULL)
+        return true;

     /*
      * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
+     * entries it points to.
      */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
+    hash_destroy(pgStatTabHash);
     pgStatTabHash = NULL;

     /*
      * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * have counts, and try flushing it out to shared stats. We may fail on
+     * some entries in the array. Leaving the entries being packed at the
+     * beginning of the array.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
     for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
     {
         for (i = 0; i < tsa->tsa_used; i++)
         {
             PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;

             /* Shouldn't have any pending transaction-dependent counts */
             Assert(entry->trans == NULL);
@@ -875,178 +916,352 @@ pgstat_report_stat(bool force)
                        sizeof(PgStat_TableCounts)) == 0)
                 continue;

-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* try to apply the tab stats */
+            if (!pgstat_flush_tabstat(cxt, nowait, entry))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                /*
+                 * Failed. Move it to the beginning in TabStatusArray and
+                 * leave it.
+                 */
+                TabStatHashEntry *hash_entry;
+                bool found;
+
+                if (new_tsa_hash == NULL)
+                    new_tsa_hash = create_tabstat_hash();
+
+                /* Create hash entry for this entry */
+                hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+                                         HASH_ENTER, &found);
+                Assert(!found);
+
+                /*
+                 * Move insertion pointer to the next segment if the segment
+                 * is filled up.
+                 */
+                if (dest_elem >= TABSTAT_QUANTUM)
+                {
+                    Assert(dest_tsa->tsa_next != NULL);
+                    dest_tsa = dest_tsa->tsa_next;
+                    dest_elem = 0;
+                }
+
+                /*
+                 * Pack the entry at the begining of the array. Do nothing if
+                 * no need to be moved.
+                 */
+                if (tsa != dest_tsa || i != dest_elem)
+                {
+                    PgStat_TableStatus *new_entry;
+                    new_entry = &dest_tsa->tsa_entries[dest_elem];
+                    *new_entry = *entry;
+
+                    /* use new_entry as entry hereafter */
+                    entry = new_entry;
+                }
+
+                hash_entry->tsa_entry = entry;
+                dest_elem++;
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }

+    /* zero out unused area of TableStatus */
+    dest_tsa->tsa_used = dest_elem;
+    MemSet(&dest_tsa->tsa_entries[dest_elem], 0,
+           (TABSTAT_QUANTUM - dest_elem) * sizeof(PgStat_TableStatus));
+    while (dest_tsa->tsa_next)
+    {
+        dest_tsa = dest_tsa->tsa_next;
+        MemSet(dest_tsa->tsa_entries, 0,
+               dest_tsa->tsa_used * sizeof(PgStat_TableStatus));
+        dest_tsa->tsa_used = 0;
+    }
+
+    /* and set the new TabStatusArray hash if any */
+    pgStatTabHash = new_tsa_hash;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * We no longer need shared database and table entries, but that for my
+     * database may be used later.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    if (cxt->shdb_tabhash)
+    {
+        dshash_detach(cxt->shdb_tabhash);
+        unpin_hashes(cxt->shdbentry, cxt->shgeneration);
+        cxt->shdb_tabhash = NULL;
+        cxt->shdbentry = NULL;
+    }
+
+    return pgStatTabHash == NULL;
 }

+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_tabstat: Flushes a table stats entry.
+ *
+ *  If nowait is true, returns false on lock failure.  Dshashes for table and
+ *  function stats are kept attached in ctx. The caller must detach them after
+ *  use.
+ *
+ *  Returns true if the entry is flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+bool
+pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
+                     PgStat_TableStatus *entry)
 {
-    int            n;
-    int            len;
+    Oid        dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+    int        table_mode = PGSTAT_EXCLUSIVE;
+    bool    updated = false;
+    dshash_table *tabhash;
+    PgStat_StatDBEntry *dbent;
+    int        generation;
+
+    if (nowait)
+        table_mode |= PGSTAT_NOWAIT;
+
+    /* Attach required table hash if not yet. */
+    if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
+    {
+        /*
+         *  Return if we don't have corresponding dbentry. It would've been
+         *  removed.
+         */
+        dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
+        if (!dbent)
+            return false;

-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+        /*
+         * We don't hold lock on the dbentry since it cannot be dropped while
+         * we are working on it.
+         */
+        generation = pin_hashes(dbent);
+        tabhash = attach_table_hash(dbent, generation);

-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
-    {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
-        pgStatXactCommit = 0;
-        pgStatXactRollback = 0;
-        pgStatBlockReadTime = 0;
-        pgStatBlockWriteTime = 0;
+        if (entry->t_shared)
+        {
+            cxt->shgeneration = generation;
+            cxt->shdbentry = dbent;
+            cxt->shdb_tabhash = tabhash;
+        }
+        else
+        {
+            cxt->mygeneration = generation;
+            cxt->mydbentry = dbent;
+            cxt->mydb_tabhash = tabhash;
+
+            /*
+             * We come here once per database. Take the chance to update
+             * database-wide stats
+             */
+            LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+            dbent->n_xact_commit += pgStatXactCommit;
+            dbent->n_xact_rollback += pgStatXactRollback;
+            dbent->n_block_read_time += pgStatBlockReadTime;
+            dbent->n_block_write_time += pgStatBlockWriteTime;
+            LWLockRelease(&dbent->lock);
+            pgStatXactCommit = 0;
+            pgStatXactRollback = 0;
+            pgStatBlockReadTime = 0;
+            pgStatBlockWriteTime = 0;
+        }
+    }
+    else if (entry->t_shared)
+    {
+        dbent = cxt->shdbentry;
+        tabhash = cxt->shdb_tabhash;
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        dbent = cxt->mydbentry;
+        tabhash = cxt->mydb_tabhash;
     }

-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);

-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    /*
+     * Local table stats should be applied to both dbentry and tabentry at
+     * once. Update dbentry only if we could update tabentry.
+     */
+    if (pgstat_update_tabentry(tabhash, entry, nowait))
+    {
+        pgstat_update_dbentry(dbent, entry);
+        updated = true;
+    }
+
+    return updated;
 }

 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entryis are left alone.
+ *
+ *  Returns true if all entries are flushed out.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(pgstat_flush_stat_context *cxt, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
+    dshash_table   *funchash;
     HASH_SEQ_STATUS fstat;
+    PgStat_BackendFunctionEntry *bestat;

+    /* nothing to do, just return */
     if (pgStatFunctions == NULL)
-        return;
+        return true;
+
+    /* get dbentry into cxt if not yet */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    funchash = attach_function_hash(cxt->mydbentry, cxt->mygeneration);
+    if (funchash == NULL)
+        return false;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    have_function_stats = false;

+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
     hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        bool found;
+        PgStat_StatFuncEntry *funcent = NULL;

-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;

-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        funcent = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert_extended(funchash, (void *) &(bestat->f_id),
+                                           &found, nowait);
+
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!funcent)
+        {
+            have_function_stats = true;
+            continue;
+        }

-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            funcent->functionid = bestat->f_id;
+            funcent->f_numcalls = bestat->f_counts.f_numcalls;
+            funcent->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            funcent->f_numcalls += bestat->f_counts.f_numcalls;
+            funcent->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            funcent->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
         }
+        dshash_release_lock(funchash, funcent);

-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
     }

-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    return !have_function_stats;
 }

+/*
+ * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
+ *
+ *  If nowait is true, returns with false on lock failure on dbentry.
+ *
+ *  Returns true if all stats are flushed out.
+ */
+static bool
+pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
+{
+    /* get dbentry if not yet */
+    if (cxt->mydbentry == NULL)
+    {
+        int op = PGSTAT_EXCLUSIVE;
+        if (nowait)
+            op |= PGSTAT_NOWAIT;
+
+        cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
+
+        /* return if lock failed. */
+        if (cxt->mydbentry == NULL)
+            return false;
+
+        /* we use this generation of table /function stats in this turn */
+        cxt->mygeneration = pin_hashes(cxt->mydbentry);
+    }
+
+    LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
+    if (HAVE_PENDING_CONFLICTS())
+        pgstat_flush_recovery_conflict(cxt->mydbentry);
+    if (BeDBStats.n_deadlocks != 0)
+        pgstat_flush_deadlock(cxt->mydbentry);
+    if (BeDBStats.n_tmpfiles != 0)
+        pgstat_flush_tempfile(cxt->mydbentry);
+    if (BeDBStats.checksum_failures != NULL)
+        pgstat_flush_checksum_failure(cxt->mydbentry);
+    LWLockRelease(&cxt->mydbentry->lock);
+
+    return true;
+}

 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;

-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;

     /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);

     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them
+     * from the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false, true);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;

@@ -1054,137 +1269,43 @@ pgstat_vacuum_stat(void)

         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }

     /* Clean up */
-    hash_destroy(htab);
+    hash_destroy(oidtab);

     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);
+    if (!dbentry)
         return;

     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);

     /*
      * Check for all tables listed in stats hashtable if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);

     /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);

-        hash_destroy(htab);
+        pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
+                                      oidtab);
     }
+    dshash_release_lock(pgStatDBHash, dbentry);
 }


@@ -1238,66 +1359,99 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     return htab;
 }

-
-/* ----------
- * pgstat_drop_database() -
+/*
+ * pgstat_remove_useless_entries - Remove useless entries from per
+ * table/function dshashes.
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
  */
 void
-pgstat_drop_database(Oid databaseid)
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
 {
-    PgStat_MsgDropdb msg;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void         *ent;

-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, false, true);

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;

+        /* Not there, so purge this entry */
+        dshash_delete_entry(dshtable, ent);
+    }
+    dshash_detach(dshtable);
+    hash_destroy(oidtab);
+}

 /* ----------
- * pgstat_drop_relation() -
+ * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
  *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
-#ifdef NOT_USED
 void
-pgstat_drop_relation(Oid relid)
+pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgTabpurge msg;
-    int            len;
+    PgStat_StatDBEntry *dbentry;
+
+    Assert (OidIsValid(databaseid));

-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatDBHash)
         return;

-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    /*
+     * Lookup the database in the hashtable with exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, PGSTAT_EXCLUSIVE, NULL);
+
+    /*
+     * If found, remove it.
+     */
+    if (dbentry)
+    {
+        /* LWLock is needed to rewrite */
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);

-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+        /* No one is using tables/functions in this dbentry */
+        Assert(dbentry->refcnt == 0);

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
+        /* Remove table/function stats dshash first. */
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
+        LWLockRelease(&dbentry->lock);

+        dshash_delete_entry(pgStatDBHash, (void *)dbentry);
+    }
+}

 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1460,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+
+    if (!pgStatDBHash)
+        return;

-    if (pgStatSock == PGINVALID_SOCKET)
+    /*
+     * Lookup the database in the hashtable.  Nothing to do if not there.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, &status);
+
+    if (!dbentry)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* This database is active, safe to release the lock immediately. */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
 }

 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1494,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }

 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,20 +1533,45 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    int generation;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, PGSTAT_EXCLUSIVE, NULL);

-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!dbentry)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* This database is active, safe to release the lock immediately. */
+    generation = pin_hashes(dbentry);

-    pgstat_send(&msg, sizeof(msg));
-}
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);

-/* ----------
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        dshash_table *t = attach_table_hash(dbentry, generation);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
+
+    if (type == RESET_FUNCTION)
+    {
+        dshash_table *t = attach_function_hash(dbentry, generation);
+        if (t)
+        {
+            dshash_delete_key(t, (void *) &objoid);
+            dshash_detach(t);
+        }
+    }
+    unpin_hashes(dbentry, generation);
+}
+
+/* ----------
  * pgstat_report_autovac() -
  *
  *    Called from autovacuum.c to report startup of an autovacuum process.
@@ -1383,48 +1582,81 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;

-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * Store the last autovacuum time in the database's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    ts = GetCurrentTimestamp();

-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 }


 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
+    int                    generation;

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = GetCurrentTimestamp();
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }

 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1667,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid                    dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table       *table;
+    int                    generation;

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;

     /*
@@ -1466,78 +1703,153 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hashtable entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, PGSTAT_EXCLUSIVE, NULL);
+    generation = pin_hashes(dbentry);
+    table = attach_table_hash(dbentry, generation);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    dshash_release_lock(table, tabentry);
+
+    dshash_detach(table);
+    unpin_hashes(dbentry, generation);
 }

 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            BeDBStats.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            BeDBStats.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            BeDBStats.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            BeDBStats.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            BeDBStats.n_conflict_startup_deadlock++;
+            break;
+    }
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    /* We had a chance to flush immediately */
+    pgstat_flush_recovery_conflict(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+}
+
+/*
+ * flush recovery conflict stats
+ */
+static void
+pgstat_flush_recovery_conflict(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_conflict_tablespace    += BeDBStats.n_conflict_tablespace;
+    dbentry->n_conflict_lock         += BeDBStats.n_conflict_lock;
+    dbentry->n_conflict_snapshot    += BeDBStats.n_conflict_snapshot;
+    dbentry->n_conflict_bufferpin    += BeDBStats.n_conflict_bufferpin;
+    dbentry->n_conflict_startup_deadlock += BeDBStats.n_conflict_startup_deadlock;
+
+    BeDBStats.n_conflict_tablespace = 0;
+    BeDBStats.n_conflict_lock = 0;
+    BeDBStats.n_conflict_snapshot = 0;
+    BeDBStats.n_conflict_bufferpin = 0;
+    BeDBStats.n_conflict_startup_deadlock = 0;
 }

 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
+    BeDBStats.n_deadlocks++;

+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);

-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (status == LOCK_FAILED)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dshash_release_lock(pgStatDBHash, dbentry);
+}

-    pgstat_send(&msg, sizeof(msg));
+/*
+ * flush dead lock stats
+ */
+static void
+pgstat_flush_deadlock(PgStat_StatDBEntry *dbentry)
+{
+    dbentry->n_deadlocks += BeDBStats.n_deadlocks;
+    BeDBStats.n_deadlocks = 0;
 }

 /* --------
@@ -1555,60 +1867,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_TableLookupResult status;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (filesize > 0) /* Is there a case where filesize is really 0? */
+    {
+        BeDBStats.tmpfilesize += filesize; /* needs check overflow */
+        BeDBStats.n_tmpfiles++;
+    }

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (BeDBStats.n_tmpfiles == 0)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+        return;
+
+    /* We had a chance to flush immediately */
+    pgstat_flush_tempfile(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }

+/*
+ * flush temporary file stats
+ */
+static void
+pgstat_flush_tempfile(PgStat_StatDBEntry *dbentry)
+{

-/* ----------
- * pgstat_ping() -
+    dbentry->n_temp_bytes += BeDBStats.tmpfilesize;
+    dbentry->n_temp_files += BeDBStats.n_tmpfiles;
+    BeDBStats.tmpfilesize = 0;
+    BeDBStats.n_tmpfiles = 0;
+}
+
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Tell the collector about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry       *dbentry;
+    PgStat_TableLookupResult status;
+    ChecksumFailureEnt       *failent = NULL;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    if (BeDBStats.checksum_failures != NULL)
+    {
+        failent = hash_search(BeDBStats.checksum_failures, &dboid,
+                              HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId,
+                                  PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
+                                  &status);
+
+    if (status == LOCK_FAILED)
+    {
+        if (!failent)
+        {
+            if (!BeDBStats.checksum_failures)
+            {
+                HASHCTL    ctl;

-    if (pgStatSock == PGINVALID_SOCKET)
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                BeDBStats.checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(BeDBStats.checksum_failures,
+                                  &dboid, HASH_ENTER, NULL);
+        }
+
+        failent->count = failurecount;
         return;
+    }
+
+    /* We have a chance to flush immediately */
+    dbentry->n_checksum_failures += failurecount;
+    BeDBStats.checksum_failures = NULL;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dshash_release_lock(pgStatDBHash, dbentry);
 }

-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
+/*
+ * flush checkpoint failure count for all databases
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_checksum_failure(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_MsgInquiry msg;
+    HASH_SEQ_STATUS     stat;
+    ChecksumFailureEnt *ent;
+    bool                release_dbent;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (BeDBStats.checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, BeDBStats.checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        release_dbent = false;
+
+        if (dbentry->databaseid != ent->dboid)
+        {
+            dbentry = pgstat_get_db_entry(ent->dboid,
+                                          PGSTAT_EXCLUSIVE, NULL);
+            if (!dbentry)
+                continue;
+
+            release_dbent = true;
+        }
+
+        dbentry->n_checksum_failures += ent->count;

+        if (release_dbent)
+            dshash_release_lock(pgStatDBHash, dbentry);
+    }
+
+    hash_destroy(BeDBStats.checksum_failures);
+    BeDBStats.checksum_failures = NULL;
+}

 /*
  * Initialize function call usage data.
@@ -1760,7 +2165,8 @@ pgstat_initstats(Relation rel)
         return;
     }

-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1780,6 +2186,24 @@ pgstat_initstats(Relation rel)
 }

 /*
+ * create_tabstat_hash - create local hash as transactional storage
+ */
+static HTAB *
+create_tabstat_hash(void)
+{
+    HASHCTL        ctl;
+
+    MemSet(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(TabStatHashEntry);
+
+    return hash_create("pgstat TabStatusArray lookup hash table",
+                       TABSTAT_QUANTUM,
+                       &ctl,
+                       HASH_ELEM | HASH_BLOBS);
+}
+
+/*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
 static PgStat_TableStatus *
@@ -1794,18 +2218,7 @@ get_tabstat_entry(Oid rel_id, bool isshared)
      * Create hash table if we don't have it already.
      */
     if (pgStatTabHash == NULL)
-    {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
-    }
+        pgStatTabHash = create_tabstat_hash();

     /*
      * Find an entry or create a new one.
@@ -2418,30 +2831,33 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "local database stats hash",
+        .hash_entsize = sizeof(PgStat_StatDBEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,   /* already attached */
+        .dsh_params = &dsh_dbparams,
+        .hash = &pgStatLocalHash,
+        .dshash = &pgStatDBHash
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *) snapshot_statentry(¶m, dbid);
 }

-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
@@ -2454,51 +2870,66 @@ pgstat_fetch_stat_dbentry(Oid dbid)
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;

-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;

-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;

     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;

     return NULL;
 }

+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends. The returned entries are cached until
+ *    transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "table stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatTabEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_tblparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->tables;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_tables;
+    param.dshash = &dbent->dshash_tables;
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(¶m, reloid);
+}
+

 /* ----------
  * pgstat_fetch_stat_funcentry() -
@@ -2513,21 +2944,90 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;

-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_extended(dbentry, func_id);

     return funcentry;
 }

+/* ----------
+ * pgstat_fetch_stat_funcentry_extended() -
+ *
+ *    Find function stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ *
+ *  dbent is type of (PgStat_StatDBEntry *) but it's body must be an
+ *  PgSTat_StatDBEntry returned from pgstat_fetch_stat_dbentry().
+ */
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_extended(PgStat_StatDBEntry *dbent, Oid funcid)
+{
+    /* context for snapshot_statentry */
+    static pgstat_snapshot_param param =
+    {
+        .hash_name = "function stats snapshot hash",
+        .hash_entsize = sizeof(PgStat_StatFuncEntry),
+        .dsh_handle = DSM_HANDLE_INVALID,
+        .dsh_params = &dsh_funcparams,
+        .hash = NULL,
+        .dshash = NULL
+    };
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    if (dbent->functions == DSM_HANDLE_INVALID)
+        return NULL;
+
+    /* set target shared hash */
+    param.dsh_handle = dbent->functions;
+
+    /* tell snapshot_statentry what variables to use */
+    param.hash = &dbent->snapshot_functions;
+    param.dshash = &dbent->dshash_functions;
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(¶m, funcid);
+}
+
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}

 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +3099,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();

-    return &archiverStats;
+    return &snapshot_archiverStats;
 }


@@ -2616,9 +3117,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();

-    return &globalStats;
+    return &snapshot_globalStats;
 }


@@ -2832,8 +3334,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }

-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutodwn */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }

 /* ----------
@@ -2931,7 +3433,7 @@ pgstat_bestart(void)
                 lbeentry.st_backendType = B_STARTUP;
                 break;
             case ArchiverProcess:
-                beentry->st_backendType = B_ARCHIVER;
+                lbeentry.st_backendType = B_ARCHIVER;
                 break;
             case BgWriterProcess:
                 lbeentry.st_backendType = B_BG_WRITER;
@@ -3067,6 +3569,10 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }

 /*
@@ -3102,6 +3608,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */

     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }


@@ -3362,7 +3870,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;

-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */

@@ -3657,9 +4166,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4322,75 +4828,43 @@ pgstat_get_backend_desc(BackendType backendType)
  * ------------------------------------------------------------
  */

-
 /* ----------
- * pgstat_setheader() -
+ * pgstat_send_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
+void
+pgstat_send_archiver(const char *xlog, bool failed)
 {
-    hdr->m_type = mtype;
-}
+    TimestampTz now = GetCurrentTimestamp();

-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+}

 /* ----------
  * pgstat_send_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
@@ -4399,6 +4873,8 @@ pgstat_send_bgwriter(void)
     /* We assume this initializes to zeroes */
     static const PgStat_MsgBgWriter all_zeroes;

+    PgStat_MsgBgWriter *s = &BgWriterStats;
+
     /*
      * This function can be called even if nothing at all has happened. In
      * this case, avoid sending a completely empty message to the stats
@@ -4407,11 +4883,18 @@ pgstat_send_bgwriter(void)
     if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
         return;

-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->m_buf_alloc;
+    LWLockRelease(StatsLock);

     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4420,276 +4903,164 @@ pgstat_send_bgwriter(void)
 }


-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
+/*
+ * Pin and Unpin dbentry.
  *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
+ * To keep less memory usage, and for speed, counters are by recreation of
+ * dshash instead of removing entries one-by-one keeping whole-dshash lock. On
+ * the other hand dshash cannot be destroyed until all referrers have gone. As
+ * the result, other backend may be kept waiting the counter reset for not a
+ * short time. We isolate the hashes under destruction as another generation,
+ * which means no longer used but cannot be removed yet.
+
+ * When we start accessing hashes on a dbentry, call pin_hashes() and acquire
+ * the current "generation". Unlock removes the older generation's hashes when
+ * all refers have gone.
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static int
+pin_hashes(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
+    int    generation;

-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->refcnt++;
+    generation = dbentry->generation;
+    LWLockRelease(&dbentry->lock);

-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    dshash_release_lock(pgStatDBHash, dbentry);

-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    return generation;
+}
+
+/*
+ * Unpin hashes in dbentry. If given generation is isolated, destroy it after
+ * all referrers has gone. Otherwise just decrease reference count then return.
+ */
+static void
+unpin_hashes(PgStat_StatDBEntry *dbentry, int generation)
+{
+    dshash_table *tables;
+    dshash_table *funcs = NULL;
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+
+    /* using current generation, just decrease refcount */
+    if (dbentry->generation == generation)
+    {
+        dbentry->refcnt--;
+        LWLockRelease(&dbentry->lock);
+        return;
+    }

     /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
+     * It is isolated, waiting for all referrers to end.
      */
-    for (;;)
+    Assert(dbentry->generation == generation + 1);
+
+    if (--dbentry->prev_refcnt > 0)
     {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
+        LWLockRelease(&dbentry->lock);
+        return;
+    }

-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
+    /* no referrer remains, remove the hashes */
+    tables = dshash_attach(area, &dsh_tblparams, dbentry->prev_tables, 0);
+    if (dbentry->prev_functions != DSM_HANDLE_INVALID)
+        funcs = dshash_attach(area, &dsh_funcparams,
+                              dbentry->prev_functions, 0);

-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;

-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
+    /* release the entry immediately */
+    LWLockRelease(&dbentry->lock);

-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
+    dshash_destroy(tables);
+    if (funcs)
+        dshash_destroy(funcs);

-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
+    return;
+}

-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
+/*
+ * attach and return the specified generation of table hash
+ * Returns NULL on lock failure.
+ */
+static dshash_table *
+attach_table_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret;

-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);

-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
+    if (dbent->generation == gen)
+        ret = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+    else
+    {
+        Assert (dbent->generation == gen + 1);
+        Assert (dbent->prev_tables != DSM_HANDLE_INVALID);
+        ret = dshash_attach(area, &dsh_tblparams, dbent->prev_tables, 0);
+    }
+    LWLockRelease(&dbent->lock);

-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
+    return ret;
+}

-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
+/* attach and return the specified generation of function hash */
+static dshash_table *
+attach_function_hash(PgStat_StatDBEntry *dbent, int gen)
+{
+    dshash_table *ret = NULL;

-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif

-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);

-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
+    if (dbent->generation == gen)
+    {
+        if (dbent->functions == DSM_HANDLE_INVALID)
+        {
+            dshash_table *funchash =
+                dshash_create(area, &dsh_funcparams, 0);
+            dbent->functions = dshash_get_hash_table_handle(funchash);
+
+            ret = funchash;
+        }
+        else
+            ret =  dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+    }
+    /* don't bother creating useless hash */
+
+    LWLockRelease(&dbent->lock);
+
+    return  ret;
+}

-    exit(0);
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
+{
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
+    dbentry->generation = 0;
+    dbentry->refcnt = 0;
+    dbentry->prev_refcnt = 0;
+    dbentry->tables = DSM_HANDLE_INVALID;
+    dbentry->prev_tables = DSM_HANDLE_INVALID;
+    dbentry->functions = DSM_HANDLE_INVALID;
+    dbentry->prev_functions = DSM_HANDLE_INVALID;
 }

 /*
  * Subroutine to clear stats in a database entry
  *
- * Tables and functions hashes are initialized to empty.
+ * Reset all counters in the dbentry. Tables and functions dshashes are
+ * destroyed.  If any backend is pinning this dbentry, the current dshashes
+ * are stashed out to the previous "generation" to wait for all accessors are
+ * gone. If the previous generation is already occupied, the current dshashes
+ * are so fresh that they doesn't need to be cleared.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);

     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4714,130 +5085,118 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->n_block_read_time = 0;
     dbentry->n_block_write_time = 0;

-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
+    if (dbentry->refcnt == 0)
+    {
+        /*
+         * No one is referring to the current hash. It's very costly to remove
+         * entries in dshash individually so just destroy the whole.  If
+         * someone pined this entry just after, pin_hashes() returns the
+         * current generation and attach will happen after the following
+         * LWLock released.
+         */
+        dshash_table *tbl;

-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+            dbentry->tables = DSM_HANDLE_INVALID;
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+            dbentry->functions = DSM_HANDLE_INVALID;
+        }
+    }
+    else if (dbentry->prev_refcnt == 0)
+    {
+        /*
+         * Someone is still referring to the current hash and previous slot is
+         * vacant. Stash out the current hash to the previous slot.
+         */
+        dbentry->prev_refcnt = dbentry->refcnt;
+        dbentry->prev_tables = dbentry->tables;
+        dbentry->prev_functions = dbentry->functions;
+        dbentry->refcnt = 0;
+        dbentry->tables = DSM_HANDLE_INVALID;
+        dbentry->functions = DSM_HANDLE_INVALID;
+        dbentry->generation++;
+    }
+    else
+    {
+        Assert(dbentry->prev_refcnt > 0 && dbentry->refcnt > 0);
+        /*
+         * If we get here, we just have got another reset request and the old
+         * hashes are waiting to all referrers to be released. It must be
+         * quite a short time so we can just ignore this request.
+         *
+         * As the side effect, the resetter can see non-zero values before
+         * anyone updates them but it's not distinctive with someone updated
+         * them before reading.
+         */
+    }

-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    /* Create new table hash if not exists */
+    if (dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_tblparams, 0);
+        dbentry->tables = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }

-    if (!create && !found)
-        return NULL;
+    /* Create new function hash if not exists and needed. */
+    if (dbentry->functions == DSM_HANDLE_INVALID &&
+        pgstat_track_functions != TRACK_FUNC_OFF)
+    {
+        dshash_table *tbl = dshash_create(area, &dsh_funcparams, 0);
+        dbentry->functions = dshash_get_hash_table_handle(tbl);
+        dshash_detach(tbl);
+    }

-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry->stat_reset_timestamp = GetCurrentTimestamp();

-    return result;
+    LWLockRelease(&dbentry->lock);
 }

-
 /*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
+    int            printed;

-    return result;
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }

-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;

+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);

     /*
@@ -4856,7 +5215,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();

     /*
      * Write the file header --- currently just a format ID.
@@ -4868,39 +5227,37 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */

     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */

     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatDBHash, false, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;

-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_pgStatDBHashfile(dbentry);

         /*
          * Write out the DB entry. We don't write the tables or functions
          * pointers, since they're of no use to any other process.
          */
         fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+        rc = fwrite(dbentry,
+                    offsetof(PgStat_StatDBEntry, generation), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }

@@ -4936,53 +5293,18 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
 }

 /* ----------
- * pgstat_write_db_statsfile() -
+ * pgstat_write_pgStatDBHashfile() -
  *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4991,9 +5313,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;

-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);

     elog(DEBUG2, "writing stats file \"%s\"", statfile);

@@ -5020,23 +5343,30 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);

     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;                /* we'll check for error with ferror */
+        }
+        dshash_detach(tbl);
     }

     /*
@@ -5071,76 +5401,37 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }

 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
- *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing statistics collector files into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
+
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);

     /*
-     * The tables will live in pgStatLocalContext.
+     * Set the current timestamp (will be kept only in case we can't load an
+     * existing statsfile).
      */
-    pgstat_setup_memcxt();
-
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-
-    /*
-     * Set the current timestamp (will be kept only in case we can't load an
-     * existing statsfile).
-     */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;

     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5154,11 +5445,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }

     /*
@@ -5167,7 +5458,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5175,32 +5466,24 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }

     /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }

@@ -5217,10 +5500,10 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * follows.
                  */
             case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
+                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, generation),
+                          fpin) != offsetof(PgStat_StatDBEntry, generation))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5229,76 +5512,36 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                /* initialize the new shared entry */
+                init_dbentry(dbentry);

-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, generation));

+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
                 break;

             case 'E':
                 goto done;

             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5308,45 +5551,35 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);

-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);

-    return dbhash;
+    return;
 }


 /* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
+ * pgstat_read_pgStatDBHashfile() -
  *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table         *tabhash = NULL;
+    dshash_table         *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];

-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);

     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5360,7 +5593,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5373,14 +5606,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }

     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing statistics file. Read it and put all the hashtable
+     * entries into place.
      */
     for (;;)
     {
@@ -5393,31 +5626,35 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

-                /*
-                 * Skip if table data not wanted.
-                 */
                 if (tabhash == NULL)
-                    break;
+                {
+                    tabhash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->tables =
+                        dshash_get_hash_table_handle(tabhash);
+                }

-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);

+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;

                 /*
@@ -5427,31 +5664,34 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

-                /*
-                 * Skip if function data not wanted.
-                 */
                 if (funchash == NULL)
-                    break;
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }

-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);

                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;

                 /*
@@ -5461,7 +5701,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;

             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5469,295 +5709,39 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }

 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);

-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);

-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }

-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 }

-
 /* ----------
  * pgstat_clear_snapshot() -
  *
@@ -5773,739 +5757,223 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);

-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }

+    if (pgStatSnapshotContext)
+        clear_snapshot  = true;
+}

-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    bool    found;

-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;

-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabentry = (PgStat_StatTabEntry *)
+        dshash_find_or_insert_extended(tabhash, (void *) &(stat->t_id),
+                                       &found, nowait);

-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabentry == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabentry->numscans = stat->t_counts.t_numscans;
+        tabentry->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabentry->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabentry->vacuum_timestamp = 0;
+        tabentry->vacuum_count = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_count = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->analyze_count = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+        tabentry->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabentry->numscans += stat->t_counts.t_numscans;
+        tabentry->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabentry->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabentry->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabentry->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabentry->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabentry->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
+            tabentry->n_live_tuples = 0;
+            tabentry->n_dead_tuples = 0;
         }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        tabentry->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabentry->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabentry->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabentry->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabentry->blocks_hit += stat->t_counts.t_blocks_hit;
     }

-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);

+    dshash_release_lock(tabhash, tabentry);
+
+    return true;
+}

-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
 static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+pgstat_update_dbentry(PgStat_StatDBEntry *dbentry, PgStat_TableStatus *stat)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
     /*
-     * Update database-wide stats.
+     * Add per-table stats to the per-database entry, too.
      */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
+    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
+    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
+    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
+    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
+    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
+    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
+    LWLockRelease(&dbentry->lock);
+}

-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
+ */
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, int op,    PgStat_TableLookupResult *status)
+{
+    PgStat_StatDBEntry *result;
+    bool        nowait = ((op & PGSTAT_NOWAIT) != 0);
+    bool        lock_acquired = true;
+    bool        found = true;

-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;

-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
+    /* Lookup or create the hash table entry for this database */
+    if (op & PGSTAT_EXCLUSIVE)
+    {
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert_extended(pgStatDBHash, &databaseid,
+                                           &found, nowait);
+        if (result == NULL)
+            lock_acquired = false;
+        else if (!found)
         {
             /*
-             * Otherwise add the values to the existing entry.
+             * If not found, initialize the new one.  This creates empty hash
+             * tables hash, too.
              */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
+            init_dbentry(result);
+            reset_dbentry_counters(result);
         }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
+        result = (PgStat_StatDBEntry *)
+            dshash_find_extended(pgStatDBHash, &databaseid, true, nowait,
+                                 nowait ? &lock_acquired : NULL);
+        if (result == NULL)
+            found = false;
     }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;

-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
+    /* Set return status if requested */
+    if (status)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
+        if (!lock_acquired)
         {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
+            Assert(nowait);
+            *status = LOCK_FAILED;
         }
+        else if (!found)
+            *status = NOT_FOUND;
         else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
+            *status = FOUND;
     }
+
+    return result;
 }

-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
+/*
+ * Lookup the hash table entry for the specified table. If no hash
+ * table entry exists, initialize it, if the create parameter is true.
+ * Else, return NULL.
  */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
-    PgStat_StatDBEntry *dbentry;
-    int            i;
+    PgStat_StatTabEntry *result;
+    bool        found;

-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);

-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
+    if (!create && !found)
+        return NULL;

-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    /* If not found, initialize the new one. */
+    if (!found)
     {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
     }
-}

-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    return result;
 }

 /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5df9333..59bf597 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;

 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1327,12 +1325,6 @@ PostmasterMain(int argc, char *argv[])
     RemovePgTempFiles();

     /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
-    /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
     autovac_init();
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }

-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2689,8 +2676,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);

         /* Reload authentication config files too */
         if (!load_hba())
@@ -3053,8 +3038,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();

             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3121,13 +3104,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);

                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3200,22 +3176,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }

-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3676,22 +3636,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }

-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */

     if (Shutdown != ImmediateShutdown)
@@ -3887,8 +3831,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3923,8 +3865,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4125,8 +4066,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }

 /*
@@ -5108,18 +5047,6 @@ SubPostmasterMain(int argc, char *argv[])

         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5238,12 +5165,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));

@@ -6148,7 +6069,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;

 #ifndef WIN32
@@ -6204,8 +6124,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;

     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6440,7 +6358,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);

     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d5..58a442f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();

 #ifdef EXEC_BACKEND

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51..162ef56 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity stats");

     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 9dba3b0..a10ec11 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)

     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }


@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;

     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();

@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();

-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle", false);
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;

         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }

+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644..51748c9 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 8a47dcd..a9e5309 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -630,6 +631,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }

     /*
@@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }

+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499..927ae31 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;

 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(

 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index ef4cb6a..96204b7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;

 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dfdac00..c415820 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL statistics collector facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -13,10 +13,11 @@

 #include "datatype/timestamp.h"
 #include "libpq/pqcomm.h"
-#include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"

@@ -41,33 +42,6 @@ typedef enum TrackFunctionsLevel
 }            TrackFunctionsLevel;

 /* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
-/* ----------
  * The data type used for counters.
  * ----------
  */
@@ -77,9 +51,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -115,13 +88,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;

-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -180,236 +146,12 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;


-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
 /* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_MsgBgWriter            bgwriter statistics
  * ----------
  */
 typedef struct PgStat_MsgBgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
     PgStat_Counter m_timed_checkpoints;
     PgStat_Counter m_requested_checkpoints;
     PgStat_Counter m_buf_written_checkpoints;
@@ -423,37 +165,13 @@ typedef struct PgStat_MsgBgWriter
 } PgStat_MsgBgWriter;

 /* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
-
-/* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -485,96 +203,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;

-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Statistic collector data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -614,16 +244,29 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_block_write_time;

     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;        /* time of db stats update */

     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    int generation;                        /* current generation of the below */
+    int    refcnt;                            /* current gen reference count */
+    dshash_table_handle tables;            /* current gen tables hash */
+    dshash_table_handle functions;        /* current gen functions hash */
+    int    prev_refcnt;                    /* prev gen reference count */
+    dshash_table_handle prev_tables;    /* prev gen tables hash */
+    dshash_table_handle prev_functions;    /* prev gen functions hash */
+    LWLock    lock;                        /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB *snapshot_tables;                 /* table entry snapshot */
+    HTAB *snapshot_functions;             /* function entry snapshot */
+    dshash_table    *dshash_tables;         /* attached tables dshash */
+    dshash_table    *dshash_functions;     /* attached functions dshash */
 } PgStat_StatDBEntry;

+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)

 /* ----------
  * PgStat_StatTabEntry            The collector's data per table (or index)
@@ -662,7 +305,7 @@ typedef struct PgStat_StatTabEntry


 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -677,7 +320,7 @@ typedef struct PgStat_StatFuncEntry


 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -693,7 +336,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;

 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -779,7 +422,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1217,6 +859,8 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;

@@ -1238,29 +882,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);

-extern void pgstat_init(void);
-extern int    pgstat_start(void);
-extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);

-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void pgstat_reset_all(void);

+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);

 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);

 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);

 extern void pgstat_report_autovac(Oid dboid);
@@ -1432,11 +1073,13 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(PgStat_StatDBEntry *dbent, Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);

 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4..13371e8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;

diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6..77d1572 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026..2885540 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>

    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f8..8b63b79 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7052,25 +7052,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>

-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 11aaef5..0f8aa93 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -195,12 +195,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser

   <para>
    The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   <productname>PostgreSQL</productname> processes through shared memory.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e86..2f04bb6 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item

 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>

 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8ea8e52..4ba2460 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -110,15 +110,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;

-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward comptibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;

 #define        StatsLock (&StatsShmem->StatsMainLock)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index f66cbc2..fa66bef 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -251,15 +251,12 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;

     backup_total = 0;
     backup_streamed = 0;
     pgstat_progress_start_command(PROGRESS_COMMAND_BASEBACKUP, InvalidOid);

-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();

     labelfile = makeStringInfo();
@@ -291,13 +288,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);

         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c1fad3b..66efb33 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -196,7 +196,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4193,17 +4192,6 @@ static struct config_string ConfigureNamesString[] =
     },

     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
-    {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
             NULL,
@@ -11488,35 +11476,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }

-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e58e478..99f8513 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -572,7 +572,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'


 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c415820..a1f1ad8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -30,7 +30,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"

-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"

 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268..f3340f7 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};

-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")

Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
Tom Lane escribió:

In patch 0003,

>          /*
> -         * Was it the archiver?  If so, just try to start a new one; no need
> -         * to force reset of the rest of the system.  (If fail, we'll try
> -         * again in future cycles of the main loop.).  Unless we were waiting
> -         * for it to shut down; don't restart it in that case, and
> -         * PostmasterStateMachine() will advance to the next shutdown step.
> +         * Was it the archiver?  Normal exit can be ignored; we'll start a new
> +         * one at the next iteration of the postmaster's main loop, if
> +         * necessary. Any other exit condition is treated as a crash.
>           */
>          if (pid == PgArchPID)
>          {
>              PgArchPID = 0;
>              if (!EXIT_STATUS_0(exitstatus))
> -                LogChildExit(LOG, _("archiver process"),
> -                             pid, exitstatus);
> -            if (PgArchStartupAllowed())
> -                PgArchPID = pgarch_start();
> +                HandleChildCrash(pid, exitstatus,
> +                                 _("archiver process"));
>              continue;
>          }

I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one.  I think this needs to be
reconsidered.  As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary.  We should
continue to just log the error message and move on.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:
> Tom Lane escribió:
> 
> In patch 0003,
> 
> >          /*
> > -         * Was it the archiver?  If so, just try to start a new one; no need
> > -         * to force reset of the rest of the system.  (If fail, we'll try
> > -         * again in future cycles of the main loop.).  Unless we were waiting
> > -         * for it to shut down; don't restart it in that case, and
> > -         * PostmasterStateMachine() will advance to the next shutdown step.
> > +         * Was it the archiver?  Normal exit can be ignored; we'll start a new
> > +         * one at the next iteration of the postmaster's main loop, if
> > +         * necessary. Any other exit condition is treated as a crash.
> >           */
> >          if (pid == PgArchPID)
> >          {
> >              PgArchPID = 0;
> >              if (!EXIT_STATUS_0(exitstatus))
> > -                LogChildExit(LOG, _("archiver process"),
> > -                             pid, exitstatus);
> > -            if (PgArchStartupAllowed())
> > -                PgArchPID = pgarch_start();
> > +                HandleChildCrash(pid, exitstatus,
> > +                                 _("archiver process"));
> >              continue;
> >          }
> 
> I'm worried that we're causing all processes to terminate when an
> archiver dies in some ugly way; but in the current coding, it's pretty
> harmless and we'd just start a new one.  I think this needs to be
> reconsidered.  As far as I know, pgarchiver remains unconnected to
> shared memory so a crash-restart cycle is not necessary.  We should
> continue to just log the error message and move on.

Why is it worth having the archiver be "robust" that way? Except that
random implementation details led to it not being connected to shared
memory, and thus allowing a restart for any exit code, I don't see a
need? It doesn't have exit paths that could validly trigger another exit
code, as far as I can see.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:
>> I'm worried that we're causing all processes to terminate when an
>> archiver dies in some ugly way; but in the current coding, it's pretty
>> harmless and we'd just start a new one.  I think this needs to be
>> reconsidered.  As far as I know, pgarchiver remains unconnected to
>> shared memory so a crash-restart cycle is not necessary.  We should
>> continue to just log the error message and move on.

> Why is it worth having the archiver be "robust" that way?

I'd ask a different question: what the heck is this patchset doing
touching the archiver in the first place?  I can see no plausible
reason for that doing anything related to stats collection.  If we
now need some new background processing for stats, let's make a
new postmaster child process to do that, not overload the archiver
with unrelated responsibilities.

            regards, tom lane



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-09 15:04:23 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:
> >> I'm worried that we're causing all processes to terminate when an
> >> archiver dies in some ugly way; but in the current coding, it's pretty
> >> harmless and we'd just start a new one.  I think this needs to be
> >> reconsidered.  As far as I know, pgarchiver remains unconnected to
> >> shared memory so a crash-restart cycle is not necessary.  We should
> >> continue to just log the error message and move on.
> 
> > Why is it worth having the archiver be "robust" that way?
> 
> I'd ask a different question: what the heck is this patchset doing
> touching the archiver in the first place?  I can see no plausible
> reason for that doing anything related to stats collection.

As of a release or two back, it sends stats messages for archiving
events:

            if (pgarch_archiveXlog(xlog))
            {
                /* successful */
                pgarch_archiveDone(xlog);

                /*
                 * Tell the collector about the WAL file that we successfully
                 * archived
                 */
                pgstat_send_archiver(xlog, false);

                break;            /* out of inner retry loop */
            }
            else
            {
                /*
                 * Tell the collector about the WAL file that we failed to
                 * archive
                 */
                pgstat_send_archiver(xlog, true);


> If we now need some new background processing for stats, let's make a
> new postmaster child process to do that, not overload the archiver
> with unrelated responsibilities.

I don't think that's what's going on. It's just changing the archiver so
it can report stats via shared memory - which subsequently means that it
needs to deal differently with errors than when it wasn't connected.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you all for the suggestions.

At Mon, 9 Mar 2020 12:25:39 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-03-09 15:04:23 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:
> > >> I'm worried that we're causing all processes to terminate when an
> > >> archiver dies in some ugly way; but in the current coding, it's pretty
> > >> harmless and we'd just start a new one.  I think this needs to be
> > >> reconsidered.  As far as I know, pgarchiver remains unconnected to
> > >> shared memory so a crash-restart cycle is not necessary.  We should
> > >> continue to just log the error message and move on.
> > 
> > > Why is it worth having the archiver be "robust" that way?
> > 
> > I'd ask a different question: what the heck is this patchset doing
> > touching the archiver in the first place?  I can see no plausible
> > reason for that doing anything related to stats collection.
> 
> As of a release or two back, it sends stats messages for archiving
> events:
> 
>             if (pgarch_archiveXlog(xlog))
>             {
>                 /* successful */
>                 pgarch_archiveDone(xlog);
> 
>                 /*
>                  * Tell the collector about the WAL file that we successfully
>                  * archived
>                  */
>                 pgstat_send_archiver(xlog, false);
> 
>                 break;            /* out of inner retry loop */
>             }
>             else
>             {
>                 /*
>                  * Tell the collector about the WAL file that we failed to
>                  * archive
>                  */
>                 pgstat_send_archiver(xlog, true);
> 
> 
> > If we now need some new background processing for stats, let's make a
> > new postmaster child process to do that, not overload the archiver
> > with unrelated responsibilities.
> 
> I don't think that's what's going on. It's just changing the archiver so
> it can report stats via shared memory - which subsequently means that it
> it can report stats via shared memory - which subsequently means that it
> needs to deal differently with errors than when it wasn't connected.

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation. We may
need the second static shared memory segment apart from the current
one.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 10 Mar 2020 12:27:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> That's true, but I have the same concern with Tom. The archive bacame
> too-tightly linked with other processes than actual relation. We may
> need the second static shared memory segment apart from the current
> one.

Anyway, I changed the target version of the entry to 14.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Andres Freund
Дата:
On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> That's true, but I have the same concern with Tom. The archive bacame
> too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.


> We may need the second static shared memory segment apart from the
> current one.

That seems absurd to me. Solving a non-problem by introducing complex
new infrastructure.



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in 
> On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> > That's true, but I have the same concern with Tom. The archive bacame
> > too-tightly linked with other processes than actual relation.
> 
> What's the problem here? We have a number of helper processes
> (checkpointer, bgwriter) that are attached to shared memory, and it's
> not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

> > We may need the second static shared memory segment apart from the
> > current one.
> 
> That seems absurd to me. Solving a non-problem by introducing complex
> new infrastructure.

Ok. I think I must be worrying too much.

Thanks for the suggestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2020-Mar-10, Kyotaro Horiguchi wrote:

> At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> > > That's true, but I have the same concern with Tom. The archive bacame
> > > too-tightly linked with other processes than actual relation.
> > 
> > What's the problem here? We have a number of helper processes
> > (checkpointer, bgwriter) that are attached to shared memory, and it's
> > not a problem.
> 
> That theoretically raises the chance of server-crash by a small amount
> of probability. But, yes, it's absurd to prmise that archiver process
> crashes.

The case I'm worried about is a misconfigured archive_command that
causes the archiver to misbehave (exit with a code other than 0); if
that already doesn't happen, or we can make it not happen, then I'm okay
with the changes to archiver.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Julien Rouhaud
Дата:
On Tue, Mar 10, 2020 at 1:48 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2020-Mar-10, Kyotaro Horiguchi wrote:
>
> > At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in
> > > On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> > > > That's true, but I have the same concern with Tom. The archive bacame
> > > > too-tightly linked with other processes than actual relation.
> > >
> > > What's the problem here? We have a number of helper processes
> > > (checkpointer, bgwriter) that are attached to shared memory, and it's
> > > not a problem.
> >
> > That theoretically raises the chance of server-crash by a small amount
> > of probability. But, yes, it's absurd to prmise that archiver process
> > crashes.
>
> The case I'm worried about is a misconfigured archive_command that
> causes the archiver to misbehave (exit with a code other than 0); if
> that already doesn't happen, or we can make it not happen, then I'm okay
> with the changes to archiver.

Any script that gets killed, or that exit with a return code > 127
would do that.



Re: shared-memory based stats collector

От
Andres Freund
Дата:
On 2020-03-10 09:48:07 -0300, Alvaro Herrera wrote:
> On 2020-Mar-10, Kyotaro Horiguchi wrote:
> 
> > At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > > On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> > > > That's true, but I have the same concern with Tom. The archive bacame
> > > > too-tightly linked with other processes than actual relation.
> > > 
> > > What's the problem here? We have a number of helper processes
> > > (checkpointer, bgwriter) that are attached to shared memory, and it's
> > > not a problem.
> > 
> > That theoretically raises the chance of server-crash by a small amount
> > of probability. But, yes, it's absurd to prmise that archiver process
> > crashes.
> 
> The case I'm worried about is a misconfigured archive_command that
> causes the archiver to misbehave (exit with a code other than 0); if
> that already doesn't happen, or we can make it not happen, then I'm okay
> with the changes to archiver.

Well, an exit(1) is also fine, afaict. No?

The archive command can just trigger either a FATAL or a LOG:

    rc = system(xlogarchcmd);
    if (rc != 0)
    {
        /*
         * If either the shell itself, or a called command, died on a signal,
         * abort the archiver.  We do this because system() ignores SIGINT and
         * SIGQUIT while waiting; so a signal is very likely something that
         * should have interrupted us too.  Also die if the shell got a hard
         * "command not found" type of error.  If we overreact it's no big
         * deal, the postmaster will just start the archiver again.
         */
        int            lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;

        if (WIFEXITED(rc))
        {
            ereport(lev,
                    (errmsg("archive command failed with exit code %d",
                            WEXITSTATUS(rc)),
                     errdetail("The failed archive command was: %s",
                               xlogarchcmd)));
        }
        else if (WIFSIGNALED(rc))
        {
#if defined(WIN32)
            ereport(lev,
                    (errmsg("archive command was terminated by exception 0x%X",
                            WTERMSIG(rc)),
                     errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value."),
                     errdetail("The failed archive command was: %s",
                               xlogarchcmd)));
#else
            ereport(lev,
                    (errmsg("archive command was terminated by signal %d: %s",
                            WTERMSIG(rc), pg_strsignal(WTERMSIG(rc))),
                     errdetail("The failed archive command was: %s",
                               xlogarchcmd)));
#endif
        }
        else
        {
            ereport(lev,
                    (errmsg("archive command exited with unrecognized status %d",
                            rc),
                     errdetail("The failed archive command was: %s",
                               xlogarchcmd)));
        }

        snprintf(activitymsg, sizeof(activitymsg), "failed on %s", xlog);
        set_ps_display(activitymsg, false);

        return false;
    }

I.e. there's only normal ways to shut down the archiver due to a failing
archvie command.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-10 19:52:22 +0100, Julien Rouhaud wrote:
> On Tue, Mar 10, 2020 at 1:48 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > On 2020-Mar-10, Kyotaro Horiguchi wrote:
> >
> > > At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in
> > > > On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:
> > > > > That's true, but I have the same concern with Tom. The archive bacame
> > > > > too-tightly linked with other processes than actual relation.
> > > >
> > > > What's the problem here? We have a number of helper processes
> > > > (checkpointer, bgwriter) that are attached to shared memory, and it's
> > > > not a problem.
> > >
> > > That theoretically raises the chance of server-crash by a small amount
> > > of probability. But, yes, it's absurd to prmise that archiver process
> > > crashes.
> >
> > The case I'm worried about is a misconfigured archive_command that
> > causes the archiver to misbehave (exit with a code other than 0); if
> > that already doesn't happen, or we can make it not happen, then I'm okay
> > with the changes to archiver.
> 
> Any script that gets killed, or that exit with a return code > 127
> would do that.

But just with a FATAL, not with something worse. And the default
handling for aux backends accepts exit code 1 (which elog uses for
FATAL) as a normal shutdown. Am I missing something here?

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

Thomas, could you look at the first two patches here, and my review
questions?


General comments about this series:
- A lot of the review comments feel like I've written them before, a
  year or more ago. I feel this patch ought to be in a much better
  state. There's a lot of IMO fairly obvious stuff here, and things that
  have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
  a lot of these can be found with the simplest spellchecker.  That's
  one thing for a patch that just has been hacked up as a POC, but this
  is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.


On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Fri, 29 Jun 2018 16:41:04 +0900
> Subject: [PATCH 1/5] sequential scan for dshash
>
> Add sequential scan feature to dshash.


>          dsa_pointer item_pointer = hash_table->buckets[i];
> @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
>                                  LW_EXCLUSIVE));
>
>      delete_item(hash_table, item);
> -    hash_table->find_locked = false;
> -    hash_table->find_exclusively_locked = false;
> -    LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +
> +    /* We need to keep partition lock while sequential scan */
> +    if (!hash_table->seqscan_running)
> +    {
> +        hash_table->find_locked = false;
> +        hash_table->find_exclusively_locked = false;
> +        LWLockRelease(PARTITION_LOCK(hash_table, partition));
> +    }
>  }

This seems like a failure prone API.

>  /*
> @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
>      Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
>                                  hash_table->find_exclusively_locked
>                                  ? LW_EXCLUSIVE : LW_SHARED));
> +    /* lock is under control of sequential scan */
> +    Assert(!hash_table->seqscan_running);
>
>      hash_table->find_locked = false;
>      hash_table->find_exclusively_locked = false;
> @@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
>      return tag_hash(v, size);
>  }
>
> +/*
> + * dshash_seq_init/_next/_term
> + *           Sequentially scan trhough dshash table and return all the
> + *           elements one by one, return NULL when no more.

s/trhough/through/

This uses a different comment style that the other functions in this
file. Why?


> + * dshash_seq_term should be called for incomplete scans and otherwise
> + * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.


> + * Returned elements are locked as is the case with dshash_find.  However, the
> + * caller must not release the lock.
> + *
> + * Same as dynanash, the caller may delete returned elements midst of a scan.

I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.


> + * If consistent is set for dshash_seq_init, the all hash table partitions are
> + * locked in the requested mode (as determined by the exclusive flag) during
> + * the scan.  Otherwise partitions are locked in one-at-a-time way during the
> + * scan.

Yet delete unconditionally retains locks?


> + */
> +void
> +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> +                bool consistent, bool exclusive)
> +{

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.


> @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
>  extern void dshash_detach(dshash_table *hash_table);
>  extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
>  extern void dshash_destroy(dshash_table *hash_table);
> -
>  /* Finding, creating, deleting entries. */
>  extern void *dshash_find(dshash_table *hash_table,
>                           const void *key, bool
>  exclusive);

There's a number of spurious changes like this.



> From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 27 Sep 2018 11:15:19 +0900
> Subject: [PATCH 2/5] Add conditional lock feature to dshash
>
> Dshash currently waits for lock unconditionally. This commit adds new
> interfaces for dshash_find and dshash_find_or_insert. The new
> interfaces have an extra parameter "nowait" taht commands not to wait
> for lock.

s/taht/that/

There should be at least a sentence or two explaining why these are
useful.


> +/*
> + * The version of dshash_find, which is allowed to return immediately on lock
> + * failure. Lock status is set to *lock_failed in that case.
> + */

Hm. Not sure I like the *lock_acquired API.

> +void *
> +dshash_find_extended(dshash_table *hash_table, const void *key,
> +                     bool exclusive, bool nowait, bool *lock_acquired)
>  {
>      dshash_hash hash;
>      size_t        partition;
>      dshash_table_item *item;
>
> +    /*
> +     * No need to return lock resut when !nowait. Otherwise the caller may
> +     * omit the lock result when NULL is returned.
> +     */
> +    Assert(nowait || !lock_acquired);
> +
>      hash = hash_key(hash_table, key);
>      partition = PARTITION_FOR_HASH(hash);
>
>      Assert(hash_table->control->magic == DSHASH_MAGIC);
>      Assert(!hash_table->find_locked);
>
> -    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
> -                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
> +    if (nowait)
> +    {
> +        if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
> +                                      exclusive ? LW_EXCLUSIVE : LW_SHARED))
> +        {
> +            if (lock_acquired)
> +                *lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?


I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).


> +/*
> + * The version of dshash_find_or_insert, which is allowed to return immediately
> + * on lock failure.
> + *
> + * Notes above dshash_find_extended() regarding locking and error handling
> + * equally apply here.

They don't, there's no lock_acquired parameter.

> + */
> +void *
> +dshash_find_or_insert_extended(dshash_table *hash_table,
> +                               const void *key,
> +                               bool *found,
> +                               bool nowait)

I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.


> From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Wed, 7 Nov 2018 16:53:49 +0900
> Subject: [PATCH 3/5] Make archiver process an auxiliary process
>
> This is a preliminary patch for shared-memory based stats collector.
> Archiver process must be a auxiliary process since it uses shared
> memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

> an auxiliary process in order to make it work.

>

> @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
>              StartupProcessMain();
>              proc_exit(1);        /* should never return */
>
> +        case ArchiverProcess:
> +            /* don't set signals, archiver has its own agenda */
> +            PgArchiverMain();
> +            proc_exit(1);        /* should never return */
> +
>          case BgWriterProcess:
>              /* don't set signals, bgwriter has its own agenda */
>              BackgroundWriterMain();

I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.


>  /* ------------------------------------------------------------
>   * Local functions called by archiver follow
>   * ------------------------------------------------------------
> @@ -219,8 +148,8 @@ pgarch_forkexec(void)
>   *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
>   *    since we don't use 'em, it hardly matters...
>   */
> -NON_EXEC_STATIC void
> -PgArchiverMain(int argc, char *argv[])
> +void
> +PgArchiverMain(void)
>  {
>      /*
>       * Ignore all signals usually bound to some action in the postmaster,
> @@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
>  static void
>  pgarch_exit(SIGNAL_ARGS)
>  {
> -    /* SIGQUIT means curl up and die ... */
> -    exit(1);
> +    PG_SETMASK(&BlockSig);
> +
> +    /*
> +     * We DO NOT want to run proc_exit() callbacks -- we're here because
> +     * shared memory may be corrupted, so we don't want to try to clean up our
> +     * transaction.  Just nail the windows shut and get out of town.  Now that
> +     * there's an atexit callback to prevent third-party code from breaking
> +     * things by calling exit() directly, we have to reset the callbacks
> +     * explicitly to make this work as intended.
> +     */
> +    on_exit_reset();
> +
> +    /*
> +     * Note we do exit(2) not exit(0).  This is to force the postmaster into a
> +     * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
> +     * process.  This is necessary precisely because we don't clean up our
> +     * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
> +     * should ensure the postmaster sees this as a crash, too, but no harm in
> +     * being doubly sure.)
> +     */
> +    exit(2);
>  }
>

This seems to be a copy of code & comments from other signal handlers that predates

commit 8e19a82640d3fa2350db146ec72916856dd02f0a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2018-08-08 19:08:10 +0300

    Don't run atexit callbacks in quickdie signal handlers.


I think this just should use SignalHandlerForCrashExit().


I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?


>  /* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

> @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
>
>      switch (backendType)
>      {
> +        case B_ARCHIVER:
> +            backendDesc = "archiver";
> +            break;

should imo include 'WAL' or such.



> From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 4/5] Shared-memory based stats collector
>
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.


> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 0bfd6151c4..a6b0bdec12 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
>  postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
>  postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
>  postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
> -postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
>  postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
>  postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
>  postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
> @@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
>     master server process.  The command arguments
>     shown for it are the same ones used when it was launched.  The next five
>     processes are background worker processes automatically launched by the
> -   master process.  (The <quote>stats collector</quote> process will not be present
> -   if you have set the system not to start the statistics collector; likewise
> -   the <quote>autovacuum launcher</quote> process can be disabled.)
> +   master process.  (The <quote>autovacuum launcher</quote> process will not
> +   be present if you have set the system not to start it.)
>     Each of the remaining
>     processes is a server process handling one client connection.  Each such
>     process sets its command line display in the form

There's more references to the stats collector than this... E.g. in
catalogs.sgml

   <xref linkend="view-table"/> lists the system views described here.
   More detailed documentation of each view follows below.
   There are some additional views that provide access to the results of
   the statistics collector; they are described in <xref
   linkend="monitoring-stats-views-table"/>.
  </para>


> diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> index 6d1f28c327..8dcb0fb7f7 100644
> --- a/src/backend/postmaster/autovacuum.c
> +++ b/src/backend/postmaster/autovacuum.c
> @@ -1956,15 +1956,15 @@ do_autovacuum(void)
>                                            ALLOCSET_DEFAULT_SIZES);
>      MemoryContextSwitchTo(AutovacMemCxt);
>
> +    /* Start a transaction so our commands have one to play into. */
> +    StartTransactionCommand();
> +
>      /*
>       * may be NULL if we couldn't find an entry (only happens if we are
>       * forcing a vacuum for anti-wrap purposes).
>       */
>      dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
>
> -    /* Start a transaction so our commands have one to play into. */
> -    StartTransactionCommand();
> -
>      /*
>       * Clean up any dead statistics collector entries for this DB. We always
>       * want to do this exactly once per DB-processing cycle, even if we find
> @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
>      if (isshared)
>      {
>          if (PointerIsValid(shared))
> -            tabentry = hash_search(shared->tables, &relid,
> -                                   HASH_FIND, NULL);
> +            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
>      }
>      else if (PointerIsValid(dbentry))
> -        tabentry = hash_search(dbentry->tables, &relid,
> -                               HASH_FIND, NULL);
> +        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
>
>      return tabentry;
>  }

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.



> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index ca5c6376e5..1ffe073a1f 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -1,15 +1,23 @@
>  /* ----------
>   * pgstat.c
>   *
> - *    All the statistics collector stuff hacked up in one big, ugly file.
> + *    Statistics collector facility.
>   *
> - *    TODO:    - Separate collector, postmaster and backend stuff
> - *              into different files.
> + *  Collects per-table and per-function usage statistics of all backends on
> + *  shared memory. pg_count_*() and friends are the interface to locally store
> + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> + *  at the end of a transaction to pulish the local stats on shared memory.
>   *

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.


> - *            - Add some automatic call for pgstat vacuuming.
> + *  To avoid congestion on the shared memory, we update shared stats no more
> + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> + *  all the local numbers cannot be flushed immediately, we postpone updates
> + *  and try the next chance after the interval of
> + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.


> +/*
> + * Operation mode and return code of pgstat_get_db_entry.
> + */
> +#define    PGSTAT_SHARED        0

This is unreferenced.


> +#define    PGSTAT_EXCLUSIVE    1
> +#define    PGSTAT_NOWAIT        2

And these should imo rather be parameters.


> +typedef enum PgStat_TableLookupResult
> +{
> +    NOT_FOUND,
> +    FOUND,
> +    LOCK_FAILED
> +} PgStat_TableLookupResult;

This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

> +#define        StatsLock (&StatsShmem->StatsMainLock)
>
> -static time_t last_pgstat_start_time;
> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct
> +{
> +    LWLock                StatsMainLock;        /* lock to protect this struct */
> +    dsa_handle             stats_dsa_handle;    /* DSA handle for stats data */
> +    dshash_table_handle db_hash_handle;
> +    dsa_pointer            global_stats;
> +    dsa_pointer            archiver_stats;
> +    int                    refcount;
> +} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?


> +/*
> + * BgWriter global statistics counters. The name cntains a remnant from the
> + * time when the stats collector was a dedicate process, which used sockets to
> + * send it.
> + */
> +PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.


> +/* common header of snapshot entry in reader snapshot hash */
> +typedef struct PgStat_snapshot
> +{
> +    Oid        key;
> +    bool    negative;
> +    void   *body;                /* end of header part: to keep alignment */
> +} PgStat_snapshot;


> +/* context struct for snapshot_statentry */
> +typedef struct pgstat_snapshot_param
> +{
> +    char           *hash_name;            /* name of the snapshot hash */
> +    int                hash_entsize;        /* element size of hash entry */
> +    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
> +    const dshash_parameters *dsh_params;/* dshash params */
> +    HTAB          **hash;                /* points to variable to hold hash */
> +    dshash_table  **dshash;                /* ditto for dshash */
> +} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!


> +/*
> + * Backends store various database-wide info that's waiting to be flushed out
> + * to shared memory in these variables.
> + *
> + * checksum_failures is the exception in that it is cluster-wide value.
> + */
> +typedef struct BackendDBStats
> +{
> +    int        n_conflict_tablespace;
> +    int        n_conflict_lock;
> +    int        n_conflict_snapshot;
> +    int        n_conflict_bufferpin;
> +    int        n_conflict_startup_deadlock;
> +    int        n_deadlocks;
> +    size_t    n_tmpfiles;
> +    size_t    tmpfilesize;
> +    HTAB    *checksum_failures;
> +} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.


> +    if (StatsShmem->refcount > 0)
> +        StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.


To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)a
{

    /*
     * We could directly remove files and recreate the shared memory area. But
     * detach then attach for simplicity.
     */
    pgstat_detach_shared_stats(false);    /* Don't write */
    pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Nor would the dynamic re-creation of the db dshash table be needed.


> +/* ----------
> + * pgstat_report_stat() -
> + *
> + *    Must be called by processes that performs DML: tcop/postgres.c, logical
> + *    receiver processes, SPI worker, etc. to apply the so far collected
> + *    per-table and function usage statistics to the shared statistics hashes.
> + *
> + *  Updates are applied not more frequent than the interval of
> + *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + *  failure if force is false and there's no pending updates longer than
> + *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + *  succeeding calls of this function.
> + *
> + *    Returns the time until the next timing when updates are applied in
> + *    milliseconds if there are no updates holded for more than
> + *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
> + *
> + *    Note that this is called only out of a transaction, so it is fine to use
> + *    transaction stop time as an approximation of current time.
> + *    ----------
> + */

Inconsistent indentation.

> +long
> +pgstat_report_stat(bool force)
>  {

> +    /* Flush out table stats */
> +    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> +        pending_stats = true;
> +
> +    /* Flush out function stats */
> +    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> +        pending_stats = true;

This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.


> -        snprintf(fname, sizeof(fname), "%s/%s", directory,
> -                 entry->d_name);
> -        unlink(fname);
> +    /* Flush out database-wide stats */
> +    if (HAVE_PENDING_DBSTATS())
> +    {
> +        if (!pgstat_flush_dbstats(&cxt, !force))
> +            pending_stats = true;
>      }

Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?


> /*
>  * snapshot_statentry() - Common routine for functions
>  *                             pgstat_fetch_stat_*entry()
>  *

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?


>  *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
>  *  not found. Returned snapshots are stable during the current transaction or
>  *  until pgstat_clear_snapshot() is called.
>  *
>  *  The snapshots are stored in a hash, pointer to which is stored in the
>  *  *HTAB variable pointed by cxt->hash. If not created yet, it is created
>  *  using hash_name, hash_entsize in cxt.
>  *
>  *  cxt->dshash points to dshash_table for dbstat entries. If not yet
>  *  attached, it is attached using cxt->dsh_handle.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:


>   /*
>    * We don't want so frequent update of stats snapshot. Keep it at least
>    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>    */
>   if (clear_snapshot)
>   {
>       clear_snapshot = false;
>
>       if (pgStatSnapshotContext &&
>           snapshot_globalStats.stats_timestamp <
>           GetCurrentStatementStartTimestamp() -
>           PGSTAT_STAT_MIN_INTERVAL * 1000)
>       {
>           MemoryContextReset(pgStatSnapshotContext);
>
>           /* Reset variables */
>           global_snapshot_is_valid = false;
>           pgStatSnapshotContext = NULL;
>           pgStatLocalHash = NULL;
>
>           pgstat_setup_memcxt();
>       }
>   }

I think we should just remove this entire local caching snapshot layer
for lookups.


> /*
>  * pgstat_flush_stat: Flushes table stats out to shared statistics.
>  *

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.



>  *  If nowait is true, returns false if required lock was not acquired
>  *  immediately. In that case, unapplied table stats updates are left alone in
>  *  TabStatusArray to wait for the next chance. cxt holds some dshash related
>  *  values that we want to carry around while updating shared stats.
>  *
>  *  Returns true if all stats info are flushed. Caller must detach dshashes
>  *  stored in cxt after use.
>  */
> static bool
> pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   static const PgStat_TableCounts all_zeroes;
>   TabStatusArray *tsa;
>   HTAB           *new_tsa_hash = NULL;
>   TabStatusArray *dest_tsa = pgStatTabList;
>   int             dest_elem = 0;
>   int             i;
>
>   /* nothing to do, just return */
>   if (pgStatTabHash == NULL)
>       return true;
>
>   /*
>    * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
>    * entries it points to.
>    */
>   hash_destroy(pgStatTabHash);
>   pgStatTabHash = NULL;
>
>   /*
>    * Scan through the TabStatusArray struct(s) to find tables that actually
>    * have counts, and try flushing it out to shared stats. We may fail on
>    * some entries in the array. Leaving the entries being packed at the
>    * beginning of the array.
>    */
>   for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
>   {

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).


>       for (i = 0; i < tsa->tsa_used; i++)
>       {
>           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
>
>           /* Shouldn't have any pending transaction-dependent counts */
>           Assert(entry->trans == NULL);
>
>           /*
>            * Ignore entries that didn't accumulate any actual counts, such
>            * as indexes that were opened by the planner but not used.
>            */
>           if (memcmp(&entry->t_counts, &all_zeroes,
>                      sizeof(PgStat_TableCounts)) == 0)
>               continue;
>
>           /* try to apply the tab stats */
>           if (!pgstat_flush_tabstat(cxt, nowait, entry))
>           {
>               /*
>                * Failed. Move it to the beginning in TabStatusArray and
>                * leave it.
>                */
>               TabStatHashEntry *hash_entry;
>               bool found;
>
>               if (new_tsa_hash == NULL)
>                   new_tsa_hash = create_tabstat_hash();
>
>               /* Create hash entry for this entry */
>               hash_entry = hash_search(new_tsa_hash, &entry->t_id,
>                                        HASH_ENTER, &found);
>               Assert(!found);
>
>               /*
>                * Move insertion pointer to the next segment if the segment
>                * is filled up.
>                */
>               if (dest_elem >= TABSTAT_QUANTUM)
>               {
>                   Assert(dest_tsa->tsa_next != NULL);
>                   dest_tsa = dest_tsa->tsa_next;
>                   dest_elem = 0;
>               }
>
>               /*
>                * Pack the entry at the begining of the array. Do nothing if
>                * no need to be moved.
>                */
>               if (tsa != dest_tsa || i != dest_elem)
>               {
>                   PgStat_TableStatus *new_entry;
>                   new_entry = &dest_tsa->tsa_entries[dest_elem];
>                   *new_entry = *entry;
>
>                   /* use new_entry as entry hereafter */
>                   entry = new_entry;
>               }
>
>               hash_entry->tsa_entry = entry;
>               dest_elem++;
>           }

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.


> bool
> pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
>                    PgStat_TableStatus *entry)
> {
>   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
>   int     table_mode = PGSTAT_EXCLUSIVE;
>   bool    updated = false;
>   dshash_table *tabhash;
>   PgStat_StatDBEntry *dbent;
>   int     generation;
>
>   if (nowait)
>       table_mode |= PGSTAT_NOWAIT;
>
>   /* Attach required table hash if not yet. */
>   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
>   {
>       /*
>        *  Return if we don't have corresponding dbentry. It would've been
>        *  removed.
>        */
>       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
>       if (!dbent)
>           return false;
>
>       /*
>        * We don't hold lock on the dbentry since it cannot be dropped while
>        * we are working on it.
>        */
>       generation = pin_hashes(dbent);
>       tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.


>       if (entry->t_shared)
>       {
>           cxt->shgeneration = generation;
>           cxt->shdbentry = dbent;
>           cxt->shdb_tabhash = tabhash;
>       }
>       else
>       {
>           cxt->mygeneration = generation;
>           cxt->mydbentry = dbent;
>           cxt->mydb_tabhash = tabhash;
>
>           /*
>            * We come here once per database. Take the chance to update
>            * database-wide stats
>            */
>           LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
>           dbent->n_xact_commit += pgStatXactCommit;
>           dbent->n_xact_rollback += pgStatXactRollback;
>           dbent->n_block_read_time += pgStatBlockReadTime;
>           dbent->n_block_write_time += pgStatBlockWriteTime;
>           LWLockRelease(&dbent->lock);
>           pgStatXactCommit = 0;
>           pgStatXactRollback = 0;
>           pgStatBlockReadTime = 0;
>           pgStatBlockWriteTime = 0;
>       }
>   }
>   else if (entry->t_shared)
>   {
>       dbent = cxt->shdbentry;
>       tabhash = cxt->shdb_tabhash;
>   }
>   else
>   {
>       dbent = cxt->mydbentry;
>       tabhash = cxt->mydb_tabhash;
>   }
>
>
>   /*
>    * Local table stats should be applied to both dbentry and tabentry at
>    * once. Update dbentry only if we could update tabentry.
>    */
>   if (pgstat_update_tabentry(tabhash, entry, nowait))
>   {
>       pgstat_update_dbentry(dbent, entry);
>       updated = true;
>   }

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.


I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?


> /*
>  * pgstat_flush_dbstats: Flushes out miscellaneous database stats.
>  *
>  *  If nowait is true, returns with false on lock failure on dbentry.
>  *
>  *  Returns true if all stats are flushed out.
>  */
> static bool
> pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
> {
>   /* get dbentry if not yet */
>   if (cxt->mydbentry == NULL)
>   {
>       int op = PGSTAT_EXCLUSIVE;
>       if (nowait)
>           op |= PGSTAT_NOWAIT;
>
>       cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);
>
>       /* return if lock failed. */
>       if (cxt->mydbentry == NULL)
>           return false;
>
>       /* we use this generation of table /function stats in this turn */
>       cxt->mygeneration = pin_hashes(cxt->mydbentry);
>   }
>
>   LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
>   if (HAVE_PENDING_CONFLICTS())
>       pgstat_flush_recovery_conflict(cxt->mydbentry);
>   if (BeDBStats.n_deadlocks != 0)
>       pgstat_flush_deadlock(cxt->mydbentry);
>   if (BeDBStats.n_tmpfiles != 0)
>       pgstat_flush_tempfile(cxt->mydbentry);
>   if (BeDBStats.checksum_failures != NULL)
>       pgstat_flush_checksum_failure(cxt->mydbentry);
>   LWLockRelease(&cxt->mydbentry->lock);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

>   dbentry = pgstat_get_db_entry(MyDatabaseId,
>                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
>                                 &status);
>
>   if (status == LOCK_FAILED)
>       return;
>
>   /* We had a chance to flush immediately */
>   pgstat_flush_recovery_conflict(dbentry);
>
>   dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.



> /* ----------
>  * pgstat_vacuum_stat() -
>  *
>  *    Remove objects we can get rid of.
>  * ----------
>  */
> void
> pgstat_vacuum_stat(void)
> {
>   HTAB       *oidtab;
>   dshash_seq_status dshstat;
>   PgStat_StatDBEntry *dbentry;
>
>   /* we don't collect stats under standalone mode */
>   if (!IsUnderPostmaster)
>       return;
>
>   /*
>    * Read pg_database and make a list of OIDs of all existing databases
>    */
>   oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
>
>   /*
>    * Search the database hash table for dead databases and drop them
>    * from the hash.
>    */
>
>   dshash_seq_init(&dshstat, pgStatDBHash, false, true);
>   while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
>   {
>       Oid         dbid = dbentry->databaseid;
>
>       CHECK_FOR_INTERRUPTS();
>
>       /* the DB entry for shared tables (with InvalidOid) is never dropped */
>       if (OidIsValid(dbid) &&
>           hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
>           pgstat_drop_database(dbid);
>   }
>
>   /* Clean up */
>   hash_destroy(oidtab);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

    /* We need to keep partition lock while sequential scan */
    if (!hash_table->seqscan_running)
    {
        hash_table->find_locked = false;
        hash_table->find_exclusively_locked = false;
        LWLockRelease(PARTITION_LOCK(hash_table, partition));
    }

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?


>   /*
>    * Similarly to above, make a list of all known relations in this DB.
>    */
>   oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
>
>   /*
>    * Check for all tables listed in stats hashtable if they still exist.
>    * Stats cache is useless here so directly search the shared hash.
>    */
>   pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);
>
>   /*
>    * Repeat the above but we needn't bother in the common case where no
>    * function stats are being collected.
>    */
>   if (dbentry->functions != DSM_HANDLE_INVALID)
>   {
>       oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
>
>       pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
>                                     oidtab);
>   }
>   dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?


> +void
> +pgstat_send_archiver(const char *xlog, bool failed)

Why do we still have functions named pgstat_send*?


Greetings,

Andres Freund




Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you very much!!

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> Thomas, could you look at the first two patches here, and my review
> questions?
> 
> 
> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
>   year or more ago. I feel this patch ought to be in a much better
>   state. There's a lot of IMO fairly obvious stuff here, and things that
>   have been mentioned multiple times previously.

I apologize for all of the obvious stuff or things that have been
mentioned..  I'll address them.

> - There's a *lot* of typos in here. I realize being an ESL is hard, but
>   a lot of these can be found with the simplest spellchecker.  That's
>   one thing for a patch that just has been hacked up as a POC, but this
>   is a multi year thread?

I'll review all changed part again.  I used ispell but I should have
failed to check much of the changes.

> - There's some odd formatting. Consider using pgindent more regularly.

I'll do so.

> More detailed comments below.

Thank you very much for the intensive review, I'm going to revise the
patch according to them.

> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-13 16:34:50 +0900, Kyotaro Horiguchi wrote:
> Thank you very much!!
> 
> At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > Thomas, could you look at the first two patches here, and my review
> > questions?
> > 
> > 
> > General comments about this series:
> > - A lot of the review comments feel like I've written them before, a
> >   year or more ago. I feel this patch ought to be in a much better
> >   state. There's a lot of IMO fairly obvious stuff here, and things that
> >   have been mentioned multiple times previously.
> 
> I apologize for all of the obvious stuff or things that have been
> mentioned..  I'll address them.
> 
> > - There's a *lot* of typos in here. I realize being an ESL is hard, but
> >   a lot of these can be found with the simplest spellchecker.  That's
> >   one thing for a patch that just has been hacked up as a POC, but this
> >   is a multi year thread?
> 
> I'll review all changed part again.  I used ispell but I should have
> failed to check much of the changes.
> 
> > - There's some odd formatting. Consider using pgindent more regularly.
> 
> I'll do so.
> 
> > More detailed comments below.
> 
> Thank you very much for the intensive review, I'm going to revise the
> patch according to them.
> 
> > I'm considering rewriting the parts of the patchset that I don't like -
> > but it'll look quite different afterwards.

I take your response to mean that you'd prefer to evolve the patch
largely on your own? I'm mainly asking because I think there's some
chance that we could till get this into v13, but if so we'll have to go
for it now.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Thomas Munro
Дата:
Hi Horiguchi-san, Andres,

I tried to rebase this (see attached, no intentional changes beyond
rebasing).  Some feedback:

On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> Thomas, could you look at the first two patches here, and my review
> questions?

Ack.

> >               dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> >                                                               LW_EXCLUSIVE));
> >
> >       delete_item(hash_table, item);
> > -     hash_table->find_locked = false;
> > -     hash_table->find_exclusively_locked = false;
> > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > +     /* We need to keep partition lock while sequential scan */
> > +     if (!hash_table->seqscan_running)
> > +     {
> > +             hash_table->find_locked = false;
> > +             hash_table->find_exclusively_locked = false;
> > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +     }
> >  }
>
> This seems like a failure prone API.

If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^.  That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan.  Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode.  Oh, I see that Andres
said the same thing later.

> [Andres complaining about comments and language stuff]

I would be happy to proof read and maybe extend the comments (writing
new comments will also help me understand and review the code!), and
maybe some code changes to move this forward.  Horiguchi-san, are you
working on another version now?  If so I'll wait for it before I do
that.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > +                             bool consistent, bool exclusive)
> > +{
>
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

+1, let's not do that if we don't need it!

> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

    /*
     * Local table stats should be applied to both dbentry and tabentry at
     * once. Update dbentry only if we could update tabentry.
     */
    if (pgstat_update_tabentry(tabhash, entry, nowait))
    {
        pgstat_update_dbentry(dbent, entry);
        updated = true;
    }

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
    dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
    dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
    dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
    dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
    dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
    dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
    dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
    LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too?  Is it really
important for the table stats and DB stats to agree with each other?
Even if it were, your current coding doesn't achieve that: the table
stats are updated before the DB stat under different locks, so I'm not
sure why it can't wait longer.

Hmm.  Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less.  I wonder if it would make sense to make
readers do more work, so that writers can avoid contention.  For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up.  An advanced version of
this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something.  CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

Вложения

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you for the comment.

The new version is attached.

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in 
> General comments about this series:
> - A lot of the review comments feel like I've written them before, a
>   year or more ago. I feel this patch ought to be in a much better
>   state. There's a lot of IMO fairly obvious stuff here, and things that
>   have been mentioned multiple times previously.
> - There's a *lot* of typos in here. I realize being an ESL is hard, but
>   a lot of these can be found with the simplest spellchecker.  That's
>   one thing for a patch that just has been hacked up as a POC, but this
>   is a multi year thread?
> - There's some odd formatting. Consider using pgindent more regularly.
> 
> More detailed comments below.
> 
> I'm considering rewriting the parts of the patchset that I don't like -
> but it'll look quite different afterwards.
> 
> On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:
> > From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> > Date: Fri, 29 Jun 2018 16:41:04 +0900
> > Subject: [PATCH 1/5] sequential scan for dshash
> >
> > Add sequential scan feature to dshash.
> 
> 
> >          dsa_pointer item_pointer = hash_table->buckets[i];
> > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> >                                  LW_EXCLUSIVE));
> >
> >      delete_item(hash_table, item);
> > -    hash_table->find_locked = false;
> > -    hash_table->find_exclusively_locked = false;
> > -    LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +
> > +    /* We need to keep partition lock while sequential scan */
> > +    if (!hash_table->seqscan_running)
> > +    {
> > +        hash_table->find_locked = false;
> > +        hash_table->find_exclusively_locked = false;
> > +        LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > +    }
> >  }
> 
> This seems like a failure prone API.

[001] (Fixed)
As the result of the fixed in [044], it's gone now.

> >  /*
> > @@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entr> > + * dshash_seq_init/_next/_term
> > + *           Sequentially scan trhough dshash table and return all the
> > + *           elements one by one, return NULL when no more.
> 
> s/trhough/through/

[002] (Fixed)

> This uses a different comment style that the other functions in this
> file. Why?

[003] (Fixed)

It was following the equivalent in dynahash.c.  I rewrote it different
way.

> > + * dshash_seq_term should be called for incomplete scans and otherwise
> > + * shoudln't. Finished scans are cleaned up automatically.
> 
> s/shoudln't/shouldn't/

[004] (Fixed)

> I find the "cleaned up automatically" API terrible. I know you copied it
> from dynahash, but I find it to be really failure prone. dynahash isn't
> an example of good postgres code, the opposite, I'd say. It's a lot
> easier to unconditionally have a terminate call if we need that.

[005] (Fixed)
OK, I remember I had a similar thought on this. Fixed with the
all correspondents.

> > + * Returned elements are locked as is the case with dshash_find.  However, the
> > + * caller must not release the lock.
> > + *
> > + * Same as dynanash, the caller may delete returned elements midst of a scan.
> 
> I think it's a bad idea to refer to dynahash here. That's just going to
> get out of date. Also, code should be documented on its own.

[006] (Fixed)
Understood, fixed as the follows.

 * Returned elements are locked and the caller must not explicitly release
 * it.

> > + * If consistent is set for dshash_seq_init, the all hash table partitions are
> > + * locked in the requested mode (as determined by the exclusive flag) during
> > + * the scan.  Otherwise partitions are locked in one-at-a-time way during the
> > + * scan.
> 
> Yet delete unconditionally retains locks?

[007] (Not fixed)
Yes. If we release the lock on the current partition, hash resize
breaks the concurrent scans.

> > + */
> > +void
> > +dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
> > +                bool consistent, bool exclusive)
> > +{
> 
> Why does this patch add the consistent mode? There's no users currently?
> Without it's not clear that we need a seperate _term function, I think?

[008] (Fixed)
I remember that it is used in early stage of development. I left it
for a matter of API completeness but actually it is not used. _term is
another matter.  We need to release lock and clean up some dshash
status if we allow seq scan to be exited before it reaches to the end.

I removed the "consistent" from dshash_seq_init and reverted
dshash_seq_term.

> I think we also can get rid of the dshash_delete changes, by instead
> adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> or such.

[009] (Fixed)
I'm not sure about the point of having two interfaces that are hard to
distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().


> > @@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
> >  extern void dshash_detach(dshash_table *hash_table);
> >  extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
> >  extern void dshash_destroy(dshash_table *hash_table);
> > -
> >  /* Finding, creating, deleting entries. */
> >  extern void *dshash_find(dshash_table *hash_table,
> >                           const void *key, bool
> >  exclusive);
> 
> There's a number of spurious changes like this.

[010] (Fixed)
I found such isolated line insertion or removal, two in 0001, eight in
0004.

> > From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> > Date: Thu, 27 Sep 2018 11:15:19 +0900
> > Subject: [PATCH 2/5] Add conditional lock feature to dshash
> >
> > Dshash currently waits for lock unconditionally. This commit adds new
> > interfaces for dshash_find and dshash_find_or_insert. The new
> > interfaces have an extra parameter "nowait" taht commands not to wait
> > for lock.
> 
> s/taht/that/

[011] (Fixed)
Applied ispell on all commit messages.

> There should be at least a sentence or two explaining why these are
> useful.

[011] (Fixed)
Sounds reasonable. I rewrote it that way.

> > +/*
> > + * The version of dshash_find, which is allowed to return immediately on lock
> > + * failure. Lock status is set to *lock_failed in that case.
> > + */
> 
> Hm. Not sure I like the *lock_acquired API.
>
> > +void *
> > +dshash_find_extended(dshash_table *hash_table, const void *key,
> > +                     bool exclusive, bool nowait, bool *lock_acquired)
...
> > +    Assert(nowait || !lock_acquired);
...
> > +            if (lock_acquired)
> > +                *lock_acquired = false;
> 
> Why is the test for lock_acquired needed here? I don't think it's
> possible to use nowait correctly without passing in lock_acquired?
> 
> Think it'd make sense to document & assert that nowait = true implies
> lock_acquired set, and nowait = false implies lock_acquired not being
> set.
> 
> But, uh, why do we even need the lock_acquired parameter? If we couldn't
> find an entry, then we should just release the lock, no?

[012] (Fixed) (related to [013], [014])
The name is confusing. In this version the old dshash_find_extended
and dshash_find_or_insert_extended are merged into new
dshash_find_extended, which covers all the functions of dshash_find
and dshash_find_or_insert plus insertion with shared lock is allowed
now.


> I'm however inclined to think it's better to just have a separate
> function for the nowait case, rather than an extended version supporting
> both (with an internal helper doing most of the work).

[013] (Fixed) (related to [012], [014])
After some thoughts, the nowait is no longer a matter of complexity.
Finally I did as [012].

> > +/*
> > + * The version of dshash_find_or_insert, which is allowed to return immediately
> > + * on lock failure.
> > + *
> > + * Notes above dshash_find_extended() regarding locking and error handling
> > + * equally apply here.
> 
> They don't, there's no lock_acquired parameter.
> 
> > + */
> > +void *
> > +dshash_find_or_insert_extended(dshash_table *hash_table,
> > +                               const void *key,
> > +                               bool *found,
> > +                               bool nowait)
> 
> I think it's absurd to have dshash_find, dshash_find_extended,
> dshash_find_or_insert, dshash_find_or_insert_extended. If they're
> extended they should also be able to specify whether the entry will get
> created.

[014] (Fixed) (related to [012], [013])
As mentioned above, this version has the original two functions and
one dshash_find_extended().

> > From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> > Date: Wed, 7 Nov 2018 16:53:49 +0900
> > Subject: [PATCH 3/5] Make archiver process an auxiliary process
> >
> > This is a preliminary patch for shared-memory based stats collector.
> > Archiver process must be a auxiliary process since it uses shared
> > memory after stats data wes moved onto shared-memory. Make the process
> 
> s/wes/was/ s/onto/into/

[015] (Fixed)

> > an auxiliary process in order to make it work.
> 
> >
> 
> > @@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
> >              StartupProcessMain();
> >              proc_exit(1);        /* should never return */
> >
> > +        case ArchiverProcess:
> > +            /* don't set signals, archiver has its own agenda */
> > +            PgArchiverMain();
> > +            proc_exit(1);        /* should never return */
> > +
> >          case BgWriterProcess:
> >              /* don't set signals, bgwriter has its own agenda */
> >              BackgroundWriterMain();
> 
> I think I'd rather remove the two comments that are copied to 6 out of 8
> cases - they don't add anything.

[016] (Fixed)
Agreed. I removed the comments from StartProcess to WalReceiverProcess.

> >  pgarch_exit(SIGNAL_ARGS)
> >  {
..
> > +     * We DO NOT want to run proc_exit() callbacks -- we're here because
> > +     * shared memory may be corrupted, so we don't want to try to clean up our
...
> > +     * being doubly sure.)
> > +     */
> > +    exit(2);
...
> This seems to be a copy of code & comments from other signal handlers that predates
..
> I think this just should use SignalHandlerForCrashExit().
> I think we can even commit that separately - there's not really a reason
> to not do that today, as far as I can tell?

[017] (Fixed, separate patch 0001)
Exactly. Although on_*_exit_list is empty on the process, SIGQUIT
ought to prevent the process from calling the functions even if
any. That changes the exit status of archiver on SIGQUIT from 1 to 2,
but that doesn't make any behavior change (other than log message).

> >  /* SIGUSR1 signal handler for archiver process */
> 
> Hm - this currently doesn't set up a correct sigusr1 handler for a
> shared memory backend - needs to invoke procsignal_sigusr1_handler
> somewhere.
> 
> We can probably just convert to using normal latches here, and remove
> the current 'wakened' logic? That'll remove the indirection via
> postmaster too, which is nice.

[018] (Fixed, separate patch 0005)
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

> > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> >
> >      switch (backendType)
> >      {
> > +        case B_ARCHIVER:
> > +            backendDesc = "archiver";
> > +            break;
> 
> should imo include 'WAL' or such.

[019] (Not Fixed)
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?  

> > From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
> > Date: Thu, 21 Feb 2019 12:44:56 +0900
> > Subject: [PATCH 4/5] Shared-memory based stats collector
..
> > megabytes. To deal with larger statistics set, this patch let backends
> > directly share the statistics via shared memory.
> 
> This spends a fair bit describing the old state, but very little
> describing the new state.

[020] (Fixed, Maybe)
Ugg.  I got the same comment in the last round. I rewrote it this time.

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 0bfd6151c4..a6b0bdec12 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
...
> > -   master process.  (The <quote>stats collector</quote> process will not be present
> > -   if you have set the system not to start the statistics collector; likewise
> > +   master process.  (The <quote>autovacuum launcher</quote> process will not
...
> There's more references to the stats collector than this... E.g. in
> catalogs.sgml

[021] (Fixed, separate patch 0007)
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track".  I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

> > diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
> > index 6d1f28c327..8dcb0fb7f7 100644
> > --- a/src/backend/postmaster/autovacuum.c
> > +++ b/src/backend/postmaster/autovacuum.c
...
> > @@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
> >      if (isshared)
> >      {
> >          if (PointerIsValid(shared))
> > -            tabentry = hash_search(shared->tables, &relid,
> > -                                   HASH_FIND, NULL);
> > +            tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
> >      }
> >      else if (PointerIsValid(dbentry))
> > -        tabentry = hash_search(dbentry->tables, &relid,
> > -                               HASH_FIND, NULL);
> > +        tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);
> >
> >      return tabentry;
> >  }
> 
> Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside

[022] (Fixed)
The _extended function is not an extended version of the original
function. I renamed pgstat_fetch_stat_tabentry_extended to
pgstat_fetch_stat_tabentry_snapshot. Also
pgstat_fetch_funcentry_extended and pgstat_fetch_dbentry() are renamed
following that.

> the stats subsystem there are exactly one caller for the non extended
> version, as far as I can see. That's index_concurrently_swap() - and imo
> that's code that should live in the stats subsystem, rather than open
> coded in index.c.

[023] (Fixed)
Agreed. I added a new function "pgstat_copy_index_counters()" and now
pgstat_fetch_stat_tabentry() has no caller sites outside pgstat
subsystem.

> > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > index ca5c6376e5..1ffe073a1f 100644
> > --- a/src/backend/postmaster/pgstat.c
> > +++ b/src/backend/postmaster/pgstat.c
> > + *  Collects per-table and per-function usage statistics of all backends on
> > + *  shared memory. pg_count_*() and friends are the interface to locally store
> > + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> > + *  at the end of a transaction to pulish the local stats on shared memory.
> >   *
> 
> I'd rather not exhaustively list the different objects this handles -
> it'll either be annoying to maintain, or just get out of date.

[024] (Fixed, Maybe)
Although not sure I get you correctly, I rewrote it as the follows.

 *  Collects per-table and per-function usage statistics of all backends on
 *  shared memory. The activity numbers are once stored locally, then written
 *  to shared memory at commit time or by idle-timeout.

> > - *            - Add some automatic call for pgstat vacuuming.
> > + *  To avoid congestion on the shared memory, we update shared stats no more
> > + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > + *  all the local numbers cannot be flushed immediately, we postpone updates
> > + *  and try the next chance after the interval of
> > + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
> 
> I'm not convinced by this backoff logic. The basic interval seems quite
> high for something going through shared memory, and the max retry seems
> pretty low.

[025] (Not Fixed)
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

> > +/*
> > + * Operation mode and return code of pgstat_get_db_entry.
> > + */
> > +#define    PGSTAT_SHARED        0
> 
> This is unreferenced.
> 
> 
> > +#define    PGSTAT_EXCLUSIVE    1
> > +#define    PGSTAT_NOWAIT        2
> 
> And these should imo rather be parameters.

[026] (Fixed)
Mmm. Right. The two symbols conveys just two distinct parameters. Two
booleans suffice. But I found some confusion here. As the result
pgstat_get_db_entry have three booleans parameters exclusive, nowait
and create.

> > +typedef enum PgStat_TableLookupResult
> > +{
> > +    NOT_FOUND,
> > +    FOUND,
> > +    LOCK_FAILED
> > +} PgStat_TableLookupResult;
> 
> This seems like a seriously bad idea to me. These are very generic
> names. There's also basically no references except setting them to the
> first two?

[027] (Fixed)
Considering some related comments above, I decided not to return lock
status from pgstat_get_db_entry. That makes the enum useless and makes
the function simpler.

> > +#define        StatsLock (&StatsShmem->StatsMainLock)
> >
> > -static time_t last_pgstat_start_time;
> > +/* Shared stats bootstrap information */
> > +typedef struct StatsShmemStruct
> > +{
> > +    LWLock                StatsMainLock;        /* lock to protect this struct */
...
> > +} StatsShmemStruct;
> 
> Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
> allocated here?

[028] (Fixed)
The activity stats system already used a dedicate tranche, so I might
think that it is natural that it is in the same tranche. Actually it's
not so firm reason. Moved the lock into main tranche.

> > +/*
> > + * BgWriter global statistics counters. The name cntains a remnant from the
> > + * time when the stats collector was a dedicate process, which used sockets to
> > + * send it.
> > + */
> > +PgStat_MsgBgWriter BgWriterStats = {0};
> 
> I am strongly against keeping the 'Msg' prefix. That seems extremely
> confusing going forward.

[029] (Fixed) (Related  to [046])
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

> > +/* common header of snapshot entry in reader snapshot hash */
> > +typedef struct PgStat_snapshot
> > +{
> > +    Oid        key;
> > +    bool    negative;
> > +    void   *body;                /* end of header part: to keep alignment */
> > +} PgStat_snapshot;
> 
> 
> > +/* context struct for snapshot_statentry */
> > +typedef struct pgstat_snapshot_param
> > +{
> > +    char           *hash_name;            /* name of the snapshot hash */
> > +    int                hash_entsize;        /* element size of hash entry */
> > +    dshash_table_handle    dsh_handle;        /* dsh handle to attach */
> > +    const dshash_parameters *dsh_params;/* dshash params */
> > +    HTAB          **hash;                /* points to variable to hold hash */
> > +    dshash_table  **dshash;                /* ditto for dshash */
> > +} pgstat_snapshot_param;
> 
> Why does this exist? The struct contents are actually constant across
> calls, yet you have declared them inside functions (as static - static
> on function scope isn't really the same as global static).

[030] (Fixed)
IIUC, I didn't want it initialized at every call and it doesn't need
an external linkage. So it was static variable on function scope.
But, first, the name _param is bogus since it actually contains
context variables.  Second the "context" variables have been moved to
other variables.  I removed the struct and moved the members to the
parameter of snapshot_statentry.

> If we want it, I think we should separate the naming more
> meaningfully. The important difference between 'hash' and 'dshash' isn't
> the hashing module, it's that one is a local copy, the other a shared
> hashtable!

[031] (Fixed)
Definitely. The parameters of snapshot_statentry now have more
meaningful names.

> > +/*
> > + * Backends store various database-wide info that's waiting to be flushed out
> > + * to shared memory in these variables.
> > + *
> > + * checksum_failures is the exception in that it is cluster-wide value.
> > + */
> > +typedef struct BackendDBStats
> > +{
> > +    int        n_conflict_tablespace;
> > +    int        n_conflict_lock;
> > +    int        n_conflict_snapshot;
> > +    int        n_conflict_bufferpin;
> > +    int        n_conflict_startup_deadlock;
> > +    int        n_deadlocks;
> > +    size_t    n_tmpfiles;
> > +    size_t    tmpfilesize;
> > +    HTAB    *checksum_failures;
> > +} BackendDBStats;
> 
> Why is this a separate struct from PgStat_StatDBEntry? We should have
> these fields in multiple places.

[032] (Fixed, Maybe) (Related to [042])
It is almost a subset of PgStat_StatDBEntry with an
exception. checksum_failures is different between the two.
Anyway, tracking of conflict events don't need to be so fast so they
have been changed to be counted on shared hash entries directly.

Checkpoint failure is handled different way so only it is left alone.

> > +    if (StatsShmem->refcount > 0)
> > +        StatsShmem->refcount++;
> 
> What prevents us from leaking the refcount here? We could e.g. error out
> while attaching, no? Which'd mean we'd leak the refcount.

[033] (Fixed)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats.  It's not leaks that I'm considering here.
(continued below)

> To me it looks like there's a lot of added complexity just because you
> want to be able to reset stats via
> 
> void
> pgstat_reset_all(void)
> {
> 
>     /*
>      * We could directly remove files and recreate the shared memory area. But
>      * detach then attach for simplicity.
>      */
>     pgstat_detach_shared_stats(false);    /* Don't write */
>     pgstat_attach_shared_stats();
> 
> Without that you'd not need the complexity of attaching, detaching to
> the same degree - every backend could just cache lookup data during
> initialization, instead of having to constantly re-compute that.

Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

> Nor would the dynamic re-creation of the db dshash table be needed.

Maybe you are mentioning the complexity of reset_dbentry_counters? It
is actually complex.  Shared stats dshash cannot be destroyed (or
dshash entry cannot be removed) during someone is working on it. It
was simpler to wait for another process to end its work but that could
slow not only the clearing process but also other processes by
frequent resetting of counters.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

> > +/* ----------
> > + * pgstat_report_stat() -
> > + *
> > + *    Must be called by processes that performs DML: tcop/postgres.c, logical
> > + *    receiver processes, SPI worker, etc. to apply the so far collected
> > + *    per-table and function usage statistics to the shared statistics hashes.
> > + *
> > + *  Updates are applied not more frequent than the interval of
> > + *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
...
> Inconsistent indentation.

[034] (Fixed)

> > +long
> > +pgstat_report_stat(bool force)
> >  {
> 
> > +    /* Flush out table stats */
> > +    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > +        pending_stats = true;
> > +
> > +    /* Flush out function stats */
> > +    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > +        pending_stats = true;
> 
> This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> being confusing while reading the code, it also made the diff much
> harder to read.

[035] (Maybe Fixed)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed? It skips
tables where someone else is working on (or another table that is in
the same dshash partition), Or "!force == nowait" is the cause of
confusion? It is now translated as "nowait = !force". (Or change the
parameter of pgstat_report_stat from "force" to "nowait"?)

> > -        snprintf(fname, sizeof(fname), "%s/%s", directory,
> > -                 entry->d_name);
> > -        unlink(fname);
> > +    /* Flush out database-wide stats */
> > +    if (HAVE_PENDING_DBSTATS())
> > +    {
> > +        if (!pgstat_flush_dbstats(&cxt, !force))
> > +            pending_stats = true;
> >      }
> 
> Linearly checking a number of stats doesn't seem like the right way
> going forward. Also seems fairly omission prone.
> 
> Why does this code check live in pgstat_report_stat(), rather than
> pgstat_flush_dbstats()?

[036] (Maybe Fixed) (Related to [041])
It's to avoid useless calls but it is no longer exists. Anyway the
code disappeared by [041].

|    /* Flush out individual stats tables */
|    pending_stats |= pgstat_flush_stat(&cxt, nowait);
|    pending_stats |= pgstat_flush_funcstats(&cxt, nowait);
|    pending_stats |= pgstat_flush_checksum_failure(cxt.mydbentry, nowait);


> > /*
> >  * snapshot_statentry() - Common routine for functions
> >  *                             pgstat_fetch_stat_*entry()
> >  *
> 
> Why has this function been added between the closely linked
> pgstat_report_stat() and pgstat_flush_stat() etc?

[037]
It seems to be left there after some editing. Moved it to just before
the caller functdions.

> Why do we still have this? A hashtable lookup is cheap, compared to
> fetching a file - so it's not to save time. Given how infrequent the
> pgstat_fetch_* calls are, it's not to avoid contention either.
> 
> At first one could think it's for consistency - but no, that's not it
> either, because snapshot_statentry() refetches the snapshot without
> control from the outside:

[038]
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction.  If not, it is broken.
(continued)

> >   /*
> >    * We don't want so frequent update of stats snapshot. Keep it at least
> >    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> >    */
...
> I think we should just remove this entire local caching snapshot layer
> for lookups.

Currently the behavior is documented as the follows and it seems reasonable.

   Another important point is that when a server process is asked to display
   any of these statistics, it first fetches the most recent report emitted by
   the collector process and then continues to use this snapshot for all
   statistical views and functions until the end of its current transaction.
   So the statistics will show static information as long as you continue the
   current transaction.  Similarly, information about the current queries of
   all sessions is collected when any such information is first requested
   within a transaction, and the same information will be displayed throughout
   the transaction.
   This is a feature, not a bug, because it allows you to perform several
   queries on the statistics and correlate the results without worrying that
   the numbers are changing underneath you.  But if you want to see new
   results with each query, be sure to do the queries outside any transaction
   block.  Alternatively, you can invoke
   <function>pg_stat_clear_snapshot</function>(), which will discard the
   current transaction's statistics snapshot (if any).  The next use of
   statistical information will cause a new snapshot to be fetched.

> > /*
> >  * pgstat_flush_stat: Flushes table stats out to shared statistics.
> >  *
> 
> Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
> or such? Given that the code for dealing with an individual table's
> entry is named pgstat_flush_tabstat() that's very confusing.

[039]
Definitely. The names are hchanged while adressing [041]

> > static bool
> > pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
...
> It seems odd that there's a tabstat specific code in pgstat_flush_stat
> (also note singular while it's processing all stats, whereas you're
> below treating pgstat_flush_tabstat as only affecting one table).

[039]
The names are hchanged while adressing [041]


> >       for (i = 0; i < tsa->tsa_used; i++)
> >       {
> >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> >
<many TableStatsArray code>
> >               hash_entry->tsa_entry = entry;
> >               dest_elem++;
> >           }
> 
> This seems like too much code. Why is this entirely different from the
> way funcstats works? The difference was already too big before, but this
> made it *way* worse.

[040]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

 * NOTE: once allocated, TabStatusArray structures are never moved or deleted
 * for the life of the backend.  Also, we zero out the t_id fields of the
 * contained PgStat_TableStatus structs whenever they are not actively in use.
 * This allows relcache pgstat_info pointers to be treated as long-lived data,
 * avoiding repeated searches in pgstat_initstats() when a relation is
 * repeatedly opened during a transaction.
(continued to below)

> One goal of this project, as I understand it, is to make it easier to
> add additional stats. As is, this seems to make it harder from the code
> level.

Indeed. I removed the TabStatsArray. Having said that it lives a long
life, its life lasts for at most transaction end. I used dynahash
entry as pgstat_info entry. One tricky part is I had to clear
entry->t_id after removal of the entry so that pgstat_initstats can
detect the removal.  It is actually safe but we can add another table
id member in the struct for the use.

> > bool
> > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> >                    PgStat_TableStatus *entry)
> > {
> >   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> >   int     table_mode = PGSTAT_EXCLUSIVE;
> >   bool    updated = false;
> >   dshash_table *tabhash;
> >   PgStat_StatDBEntry *dbent;
> >   int     generation;
> >
> >   if (nowait)
> >       table_mode |= PGSTAT_NOWAIT;
> >
> >   /* Attach required table hash if not yet. */
> >   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> >   {
> >       /*
> >        *  Return if we don't have corresponding dbentry. It would've been
> >        *  removed.
> >        */
> >       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> >       if (!dbent)
> >           return false;
> >
> >       /*
> >        * We don't hold lock on the dbentry since it cannot be dropped while
> >        * we are working on it.
> >        */
> >       generation = pin_hashes(dbent);
> >       tabhash = attach_table_hash(dbent, generation);
> 
> This again is just cost incurred by insisting on destroying hashtables
> instead of keeping them around as long as necessary.

[040]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033]

> >   /*
> >    * Local table stats should be applied to both dbentry and tabentry at
> >    * once. Update dbentry only if we could update tabentry.
> >    */
> >   if (pgstat_update_tabentry(tabhash, entry, nowait))
> >   {
> >       pgstat_update_dbentry(dbent, entry);
> >       updated = true;
> >   }
> 
> At this point we're very deeply nested. pgstat_report_stat() ->
> pgstat_flush_stat() -> pgstat_flush_tabstat() ->
> pgstat_update_tabentry().
> 
> That's way over the top imo.

[041] (Fixed) (Related to [036])
Completely agree, It is a result of that I wanted to avoid scanning
pgStatTables twice.
(continued)

> I don't think it makes much sense that pgstat_update_dbentry() is called
> separately for each table. Why would we want to constantly lock that
> entry? It seems to be much more sensible to instead have
> pgstat_flush_stat() transfer the stats it reported to the pending
> database wide counters, and then report that to shared memory *once* per
> pgstat_report_stat() with pgstat_flush_dbstats()?

In the attched it scans PgStat_StatDBEntry twice. Once for the tables
of current database and another for shared tables. That change
simplified the logic around.

pgstat_report_stat()
  pgstat_flush_tabstats(<tables of current dataase>)
    pgstat_update_tabentry() (at bottom)

> >   LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
> >   if (HAVE_PENDING_CONFLICTS())
> >       pgstat_flush_recovery_conflict(cxt->mydbentry);
> >   if (BeDBStats.n_deadlocks != 0)
> >       pgstat_flush_deadlock(cxt->mydbentry);
..
> What's the point of having all these sub-functions? I see that you, for
> an undocumented reason, have pgstat_report_recovery_conflict() flush
> conflict stats immediately:

[042]
Fixed by [032].

> >   dbentry = pgstat_get_db_entry(MyDatabaseId,
> >                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> >                                 &status);
> >
> >   if (status == LOCK_FAILED)
> >       return;
> >
> >   /* We had a chance to flush immediately */
> >   pgstat_flush_recovery_conflict(dbentry);
> >
> >   dshash_release_lock(pgStatDBHash, dbentry);
> 
> But I don't understand why? Nor why we'd not just report all pending
> database wide changes in that case?
> 
> The fact that you're locking the per-database entry unconditionally once
> for each table almost guarantees contention - and you're not using the
> 'conditional lock' approach for that. I don't understand.

[043] (Maybe fixed) (Related to [045].)
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat(). dshash_find_or_insert didn't allow shared
lock. I changed dshash_find_extended to allow shared-lock even if it
is told to create a missing entry. Alrhough it takes exclusive lock at
the mement of entry creation, most of all cases it doesn't need
exclusive lock. This allows use shared lock while processing vacuum or
analyze stats.

Previously I thought that we can work on a shared database entry while
lock is not held, but actually there are cases where insertion of a
new database entry causes rehash (resize). The operation moves entries
so we need at least shared lock on database entry while we are working
on it. So in the attched basically most operations are working by the
following steps.

- get shared database entry with shared lock
  - attach table/function hash
    - fetch an entry with exclusive lock
      - update entry
    - release the table/function entry
  - detach table/function hash

  if needed
    - take LW_EXCLUSIVE on database entry
      - update database numbers
    - release LWLock
- release shared database entry
  
> > pgstat_vacuum_stat(void)
> > {
...
> >   dshash_seq_init(&dshstat, pgStatDBHash, false, true);
> >   while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
> >           pgstat_drop_database(dbid);
..
> So, uh, pgstat_drop_database() again does a *separate* lookup in the
> dshash, locking the entry. Which only works because you added this dirty
> hack:
> 
>     /* We need to keep partition lock while sequential scan */
>     if (!hash_table->seqscan_running)
>     {
>         hash_table->find_locked = false;
>         hash_table->find_exclusively_locked = false;
>         LWLockRelease(PARTITION_LOCK(hash_table, partition));
>     }
> 
> to dshash_delete_entry(). This seems insane to me. There's not even a
> comment explaining this?

[044]

Following [001] and [009], I added
dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of
pgstat_delete_entry(). The hack is gone.

(pgstat_vacuum_stat(void))
> >   }
> >   dshash_release_lock(pgStatDBHash, dbentry);
> 
> Wait, why are we holding the database partition lock across all this?
> Again without any comments explaining why?

[045] (I'm not sure it is fixed)
The lock is shared lock in the current version. The database entry is
needed only for attaching table hash and now the hashes won't be
removed. So, as maybe you suggested, the lock can be released earlier in:

 pgstat_report_stat()
 pgstat_flush_funcstats()
 pgstat_vacuum_stat()
 pgstat_reset_single_counter()
 pgstat_report_vacuum()
 pgstat_report_analyze()

The following functions are working on the database entry so lock needs to be retained till the end of its work.

 pgstat_flush_dbstats()
 pgstat_drop_database()   /* needs exclusive lock */
 pgstat_reset_counters()
 pgstat_report_autovac()
 pgstat_report_recovery_conflict()
 pgstat_report_deadlock()
 pgstat_report_tempfile()
 pgstat_report_checksum_failures_in_db()
 pgstat_flush_checksum_failure() /* repeats short-time lock on each dbs */

> > +void
> > +pgstat_send_archiver(const char *xlog, bool failed)
> 
> Why do we still have functions named pgstat_send*?

[046] (Fixed)
Same as [029] and I changed it to pgstat_report_archiver().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 79aad94b4cf07c1de8e1a085c9b2c1365a78d4be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v25 1/8] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From 269f8966be3fbc958f7df4c505b104527c506fdf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v25 2/8] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 162 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 +++++
 2 files changed, 182 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..2086bdbea9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        seqscan_running;/* now under sequential scan */
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -127,6 +128,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +158,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -228,6 +237,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +289,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->seqscan_running = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -324,7 +335,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -568,6 +579,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
                                 hash_table->find_exclusively_locked
                                 ? LW_EXCLUSIVE : LW_SHARED));
+    /* lock is under control of sequential scan */
+    Assert(!hash_table->seqscan_running);
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
@@ -592,6 +605,153 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    /* allowed at most one scan at once */
+    Assert(!hash_table->seqscan_running);
+
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+    hash_table->seqscan_running = true;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    Assert(status->hash_table->seqscan_running);
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        /* Move lock along with partition for the bucket */
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Lock the next partition then release the current, not in the
+             * reverse order to avoid concurrent resizing. Partitions are
+             * locked in the same order with resize() so dead locks won't
+             * happen.
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    Assert(status->hash_table->seqscan_running);
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+    status->hash_table->seqscan_running = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From e6d9ea8f7ac0ec29e8da064a7fed2d943a57bcec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v25 3/8] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 117 ++++++++++++++++++++++++---------------
 src/include/lib/dshash.h |   3 +
 2 files changed, 75 insertions(+), 45 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 2086bdbea9..9a9b818d86 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -386,6 +386,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -395,36 +399,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -442,30 +417,61 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
-    dshash_table_item *item;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
 
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.  exclusive may be false in insert mode, but this function may
+ * take exclusive lock temporarily when actual insertion happens.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
+    dshash_table_item *item;
+    bool        inserted = false;
+
+    /* must be exclusive when insert allowed */
+    Assert(!insert || found != NULL);
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = !inserted;
+    }
     else
     {
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
+
         *found = false;
 
         /* Check if we are getting too full. */
@@ -482,26 +488,47 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
         }
 
+        /* need to upgrade the lock to exclusive mode */
+        if (!exclusive)
+        {
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            LWLockAcquire(PARTITION_LOCK(hash_table, partidx), LW_EXCLUSIVE);
+        }
+
         /* Finally we can try to insert the new item. */
         item = insert_into_bucket(hash_table, key,
                                   &BUCKET_FOR_HASH(hash_table, hash));
         item->hash = hash;
         /* Adjust per-lock-partition counter for load factor knowledge. */
         ++partition->count;
+
+        if (!exclusive)
+        {
+            /*
+             * The entry can be removed while downgrading lock. Re-find it for
+             * safety.
+             */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            inserted = true;
+
+            goto restart;
+        }
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From d214960c1f1ed8cde84f9c97e5a9caec1e411c48 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v25 4/8] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/bootstrap/bootstrap.c   | 22 +++++----
 src/backend/postmaster/pgarch.c     | 75 +----------------------------
 src/backend/postmaster/postmaster.c | 43 +++++++++++------
 src/include/miscadmin.h             |  2 +
 src/include/postmaster/pgarch.h     |  4 +-
 5 files changed, 46 insertions(+), 100 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..4971b3ae42 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -78,7 +78,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,7 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -218,8 +147,8 @@ pgarch_forkexec(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2b9ab32293..cab7fb5381 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5493,6 +5504,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
-- 
2.18.2

From 6de0656c222bbb5e1f8c84c703796fee0e518740 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 22:30:41 +0900
Subject: [PATCH v25 5/8] Use latch instead of SIGUSR1 to wake up archiver

This is going to be combined into the archiver patch just before.
---
 src/backend/access/transam/xlog.c        | 49 ++++++++++++++++++++++++
 src/backend/access/transam/xlogarchive.c |  2 +-
 src/backend/postmaster/pgarch.c          | 27 ++++++-------
 src/backend/postmaster/postmaster.c      | 10 -----
 src/include/access/xlog.h                |  2 +
 src/include/access/xlog_internal.h       |  1 +
 src/include/storage/pmsignal.h           |  1 -
 7 files changed, 65 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fa446ffa4..5c477211e9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -668,6 +668,13 @@ typedef struct XLogCtlData
      */
     Latch        recoveryWakeupLatch;
 
+    /*
+     * archiverWakeupLatch is used to wake up the archiver process to process
+     * completed WAL segments, if it is waiting for WAL to arrive.
+     * Protected by info_lck.
+     */
+    Latch       *archiverWakeupLatch;
+
     /*
      * During recovery, we keep a copy of the latest checkpoint record here.
      * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8359,6 +8366,48 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
     return result;
 }
 
+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+    Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+    
+    SpinLockAcquire(&XLogCtl->info_lck);
+    old_latch = XLogCtl->archiverWakeupLatch;
+    XLogCtl->archiverWakeupLatch = MyLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+    Assert (old_latch == NULL);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    XLogCtl->archiverWakeupLatch = NULL;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+    Latch *latch;
+    
+    SpinLockAcquire(&XLogCtl->info_lck);
+    latch = XLogCtl->archiverWakeupLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (latch)
+        SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 188b73e752..cedf969812 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -535,7 +535,7 @@ XLogArchiveNotify(const char *xlog)
 
     /* Notify archiver that it's got something to do */
     if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+        XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 4971b3ae42..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -94,7 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -141,6 +141,13 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+    XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
@@ -160,7 +167,7 @@ PgArchiverMain(void)
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
@@ -169,24 +176,14 @@ PgArchiverMain(void)
     MyBackendType = B_ARCHIVER;
     init_ps_display(NULL);
 
+    XLogArchiveWakeupStart();
+    on_shmem_exit(PgArchiverKill, 0);
+    
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index cab7fb5381..fab4a9dd51 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5262,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..59e2f0f95a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -311,6 +311,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..a272d62b1f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -331,6 +331,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
-- 
2.18.2

From d2a7f51a744b2feca23bef8b95f48cef9ff61acf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v25 6/8] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   12 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4625 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  500 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 20 files changed, 1991 insertions(+), 3387 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5c477211e9..4ea29b8997 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8506,9 +8506,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 76fd938ce3..613cef9282 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1687,28 +1687,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be sent by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index da75e755f0..333712d3c5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1956,15 +1956,15 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
+    /* Start a transaction so our commands have one to play into. */
+    StartTransactionCommand();
+
     /*
      * may be NULL if we couldn't find an entry (only happens if we are
      * forcing a vacuum for anti-wrap purposes).
      */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
-    /* Start a transaction so our commands have one to play into. */
-    StartTransactionCommand();
-
     /*
      * Clean up any dead statistics collector entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
@@ -2748,12 +2748,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = pgstat_fetch_stat_tabentry_snapshot(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
         /*
          * Send off activity statistics to the stats collector
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
          * worth the trouble to split the stats support into two independent
          * stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
         /*
          * Report interim activity statistics to the stats collector.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f9287b7942..34a4005791 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to publish the local stats on shared memory.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +27,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +36,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
-
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
+#define PGSTAT_STAT_MIN_INTERVAL    500 /* Minimum interval of stats data
+                                         * updates; in milliseconds. */
 
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_STAT_RETRY_INTERVAL    100 /* Retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_STAT_MAX_INTERVAL   1000 /* Longest interval of stats data
+                                         * updates; in milliseconds. */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +87,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,76 +101,96 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
-
-/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
- */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
-
-typedef struct TabStatusArray
+/* Shared stats bootstrap information, protected by StatsLock */
+typedef struct StatsShmemStruct
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    dsa_handle    stats_dsa_handle;    /* DSA handle for stats data */
+    dshash_table_handle db_hash_handle;
+    dsa_pointer global_stats;
+    dsa_pointer archiver_stats;
+    int            refcount;
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *pgStatDBHash = NULL;
+
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    SHARED_DBENT_SIZE,
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * Backends store per-table info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by table OID).
  */
-typedef struct TabStatHashEntry
-{
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+static HTAB *pgStatTables = NULL;
 
 /*
- * Hash table for O(1) t_id -> tsa_entry lookup
+ * Backends store per-function info that's waiting to be flushed out to shared
+ * memory in this hash table (indexed by function OID).
  */
-static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatFunctions = NULL;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Backends store database-wide counters that's waiting to be flushed out to
+ * shared memory.
  */
-static HTAB *pgStatFunctions = NULL;
+static PgStat_TableCounts pgStatMyDatabaseStats = {0};
+static PgStat_TableCounts pgStatSharedDatabaseStats = {0};
 
 /*
  * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * written out to the shared stats.
  */
+static bool have_mydatabase_stats = false;
+static bool have_shdatabase_stats = false;
+static bool have_table_stats = false;
 static bool have_function_stats = false;
 
+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+    Oid            key;
+    bool        negative;
+    void       *body;            /* end of header part: to keep alignment */
+}            PgStat_snapshot;
+
+/* Hash entry struct for checksum_failures above */
+typedef struct ChecksumFailureEnt
+{
+    Oid            dboid;
+    int            count;
+}            ChecksumFailureEnt;
+
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
  * into PgStat_TableStatus counters until we know if it is going to commit
@@ -236,11 +226,15 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatLocalHash = NULL;
+static bool clear_snapshot = false;
+
+/* Count checksum failure for each database */
+HTAB       *checksum_failures = NULL;
+int            nchecksum_failures = 0;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +243,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
- */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots. Their validity is indicated by global_snapshot_is_valid.
  */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -275,33 +267,34 @@ static instr_time total_func_time;
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
-
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
+static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool exclusive,
+                                               bool nowait, bool create);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table,
                                                  Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static void pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
+static void pgstat_flush_dbstats(bool shared, bool nowait);
+static bool pgstat_flush_tabstats(Oid dbid, dshash_table_handle tabhandle,
+                                  bool nowait);
+static bool pgstat_flush_funcstats(dshash_table_handle funchandle, bool nowait);
+static bool pgstat_update_tabentry(dshash_table *tabhash,
+                                   PgStat_TableStatus *stat, bool nowait);
 static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
 
+static void pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                                          const dshash_parameters *dshparams,
+                                          HTAB *oidtab);
 static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
 
 static void pgstat_setup_memcxt(void);
+static void pgstat_flush_checksum_failure(bool nowait);
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry_snapshot(PgStat_StatDBEntry *dbent, Oid funcid);
+static void pgstat_snapshot_global_stats(void);
 
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
@@ -309,484 +302,210 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+/* ------------------------------------------------------------
+ * Local support functions follow
+ * ------------------------------------------------------------
+ */
+static void reset_dbentry_counters(PgStat_StatDBEntry *dbentry);
 
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
+        Assert(!found);
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+/* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory.
+ * ---------
+ */
+static void
+pgstat_attach_shared_stats(void)
+{
+    PgStat_StatDBEntry *dbent;
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* Need to create shared memory area and load saved stats if any. */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatDBHash = dshash_create(area, &dsh_dbparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->db_hash_handle = dshash_get_hash_table_handle(pgStatDBHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside StatsLock.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
-
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Shared area already exists. Just attach it. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatDBHash = dshash_attach(area, &dsh_dbparams,
+                                     StatsShmem->db_hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
-
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    MemoryContextSwitchTo(oldcontext);
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * create db entries for the current database and shared table if not
+     * created yet.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    dbent = pgstat_get_db_entry(MyDatabaseId, false, false, true);
+    Assert(dbent);
+    dshash_release_lock(pgStatDBHash, dbent);
+    dbent = pgstat_get_db_entry(InvalidOid, false, false, true);
+    Assert(dbent);
+    dshash_release_lock(pgStatDBHash, dbent);
+
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * pgstat_detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+pgstat_detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /* write out the shared stats to file if needed */
+    if (--StatsShmem->refcount < 1)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
+        /* We're the last process. Invalidate the dsa area handle. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
 
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+    LWLockRelease(StatsLock);
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
-    }
-    FreeDir(dir);
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatDBHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
-
-#ifdef EXEC_BACKEND
-
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * detach then attach for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    pgstat_detach_shared_stats(false);    /* Don't write */
+    pgstat_attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,259 +513,441 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
-    int            i;
+    PgStat_StatDBEntry *dbent;
+    bool        nowait = !force;    /* Don't use force ever after */
+    long        elapsed;
+    long        secs;
+    int            usecs;
+    dshash_table_handle tables_handle;
+    dshash_table_handle functions_handle;
+    bool        process_shared_tables = false;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL ||
+        (!have_table_stats && !have_function_stats &&
+         !have_mydatabase_stats && !have_shdatabase_stats &&
+         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+         checksum_failures != NULL))
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
 
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (nowait)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats unless it's the time.  Returns time to wait in
+         * milliseconds.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            /* Record the oldest pending update if not yet. */
+            if (pending_since == 0)
+                pending_since = now;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            /* now < next_flush here */
+            return (next_flush - now) / 1000;
+        }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
-                continue;
+        /*
+         * Don't keep pending updates longer than PGSTAT_STAT_MAX_INTERVAL.
+         */
+        if (pending_since > 0)
+        {
+            TimestampDifference(pending_since, now, &secs, &usecs);
+            elapsed = secs * 1000 + usecs / 1000;
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
-            {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
-            }
+            if (elapsed > PGSTAT_STAT_MAX_INTERVAL)
+                nowait = false;
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
     }
 
-    /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
-     */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
-}
+    /* Flush out individual stats tables */
+    dbent = pgstat_get_db_entry(MyDatabaseId, false, nowait, false);
+    tables_handle = dbent->tables;
+    functions_handle = dbent->functions;
+    dshash_release_lock(pgStatDBHash, dbent);
 
-/*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
- */
+    /* dbent is no longer usable but indicates it was acquired */
+    if (dbent)
+    {
+        process_shared_tables =
+            pgstat_flush_tabstats(MyDatabaseId, tables_handle, nowait);
+        pgstat_flush_funcstats(functions_handle, nowait);
+    }
+    else
+    {
+        /* uncertain whether shared table stats exists, try it */
+        process_shared_tables = true;
+    }
+
+    /* update database-side stats */
+    pgstat_flush_checksum_failure(nowait);
+    pgstat_flush_dbstats(false, nowait);    /* MyDatabase */
+
+    if (process_shared_tables)
+    {
+        /* shared tables found, process them */
+        dbent = pgstat_get_db_entry(InvalidOid, false, nowait, false);
+        tables_handle = dbent->tables;
+        dshash_release_lock(pgStatDBHash, dbent);
+
+        if (dbent)
+            pgstat_flush_tabstats(InvalidOid, tables_handle, nowait);
+    }
+    pgstat_flush_dbstats(true, nowait); /* Shared tables */
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /* Record how long we are keeping pending updates. */
+    if (have_table_stats || have_function_stats ||
+        have_mydatabase_stats || have_shdatabase_stats ||
+        checksum_failures != NULL)
+    {
+        /* Preserve the first value */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /*
+         * It's possible that the retry interval is longer than the limit by
+         * PGSTAT_STAT_MAX_INTERVAL. We don't bother that since it's not so
+         * much.
+         */
+        return PGSTAT_STAT_RETRY_INTERVAL;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_STAT_MIN_INTERVAL * 1000;
+    pending_since = 0;
+
+    return 0;
+}
+
+
+/*
+ * pgstat_flush_dbstats: Flushes database stats out to shared statistics.
+ *
+ *  If nowait is true, returns immediately if required lock was not acquired.
+ */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+pgstat_flush_dbstats(bool shared, bool nowait)
 {
-    int            n;
-    int            len;
+    PgStat_StatDBEntry *dbent;
+    PgStat_TableCounts *s;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (shared)
+    {
+        if (!have_shdatabase_stats)
+            return;
+        dbent = pgstat_get_db_entry(InvalidOid, false, nowait, false);
+        if (!dbent)
+            return;
 
-    /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
-     */
-    if (OidIsValid(tsmsg->m_databaseid))
+        s = &pgStatSharedDatabaseStats;
+        have_shdatabase_stats = false;
+    }
+    else
+    {
+        if (!have_mydatabase_stats)
+            return;
+        dbent = pgstat_get_db_entry(MyDatabaseId, false, nowait, false);
+        if (!dbent)
+            return;
+
+        s = &pgStatMyDatabaseStats;
+        have_mydatabase_stats = false;
+    }
+
+    /* We got the database entry, update database-wide stats */
+    LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+    dbent->counts.n_tuples_returned += s->t_tuples_returned;
+    dbent->counts.n_tuples_fetched += s->t_tuples_fetched;
+    dbent->counts.n_tuples_inserted += s->t_tuples_inserted;
+    dbent->counts.n_tuples_updated += s->t_tuples_updated;
+    dbent->counts.n_tuples_deleted += s->t_tuples_deleted;
+    dbent->counts.n_blocks_fetched += s->t_blocks_fetched;
+    dbent->counts.n_blocks_hit += s->t_blocks_hit;
+
+    if (!shared)
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        dbent->counts.n_xact_commit += pgStatXactCommit;
+        dbent->counts.n_xact_rollback += pgStatXactRollback;
+        dbent->counts.n_block_read_time += pgStatBlockReadTime;
+        dbent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
         pgStatBlockWriteTime = 0;
     }
-    else
+    LWLockRelease(&dbent->lock);
+
+    dshash_release_lock(pgStatDBHash, dbent);
+}
+
+/*
+ * pgstat_flush_tabstats: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns false if required lock was not acquired
+ *  immediately. In that case, unapplied table stats updates are left alone in
+ *  pgStatTables to wait for the next chance. cxt holds some dshash related
+ *  values that we want to carry around while updating shared stats.
+ *
+ *  Returns true if entries for another database is found in pgStatTables.
+ */
+static bool
+pgstat_flush_tabstats(Oid dbid, dshash_table_handle tabhandle, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+
+    HASH_SEQ_STATUS scan;
+    PgStat_TableStatus *bestat;
+    dshash_table *tabhash;
+    bool        anotherdb_found = false;
+
+    /* nothing to do, just return */
+    if (!have_table_stats)
+        return false;
+
+    have_table_stats = false;
+
+    tabhash = dshash_attach(area, &dsh_tblparams, tabhandle, 0);
+
+    /*
+     * Scan through the pgStatTables to find tables that actually have counts,
+     * and try flushing it out to shared stats.
+     */
+    hash_seq_init(&scan, pgStatTables);
+    while ((bestat = (PgStat_TableStatus *) hash_seq_search(&scan)) != NULL)
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        bool        remove_entry = false;
+
+        /*
+         * Ignore entries that didn't accumulate any actual counts, such as
+         * indexes that were opened by the planner but not used.
+         */
+        if (memcmp(&bestat->t_counts, &all_zeroes,
+                   sizeof(PgStat_TableCounts)) == 0)
+            remove_entry = true;
+        /* Ignore entries of databases other than our current target */
+        else if (dbid != (bestat->t_shared ? InvalidOid : MyDatabaseId))
+            anotherdb_found = true;
+        else if (pgstat_update_tabentry(tabhash, bestat, nowait))
+        {
+            PgStat_TableCounts *s;
+
+            if (dbid == bestat->t_shared)
+            {
+                s = &pgStatSharedDatabaseStats;
+                have_shdatabase_stats = true;
+            }
+            else
+            {
+                s = &pgStatMyDatabaseStats;
+                have_mydatabase_stats = true;
+            }
+
+            /* database count is applied at once later */
+            s->t_tuples_returned += bestat->t_counts.t_tuples_returned;
+            s->t_tuples_fetched += bestat->t_counts.t_tuples_fetched;
+            s->t_tuples_inserted += bestat->t_counts.t_tuples_inserted;
+            s->t_tuples_updated += bestat->t_counts.t_tuples_updated;
+            s->t_tuples_deleted += bestat->t_counts.t_tuples_deleted;
+            s->t_blocks_fetched += bestat->t_counts.t_blocks_fetched;
+            s->t_blocks_hit += bestat->t_counts.t_blocks_hit;
+
+            remove_entry = true;
+        }
+
+        if (remove_entry)
+        {
+            /*
+             * Reuse of the entry is detected with t_id in pgstat_initstats.
+             * Set invalid value after removal because the value is needed to
+             * remove the entry.
+             */
+            hash_search(pgStatTables, &bestat->t_id, HASH_REMOVE, NULL);
+            bestat->t_id = InvalidOid;
+        }
+        else
+            have_table_stats = true;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    dshash_detach(tabhash);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return anotherdb_found;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * pgstat_flush_funcstats: Flushes function stats.
+ *
+ *  If nowait is true, returns false on lock failure. Unapplied local hash
+ *  entries are left alone.
+ *
+ *  Returns true if some entries are left unflushed.
  */
-static void
-pgstat_send_funcstats(void)
+static bool
+pgstat_flush_funcstats(dshash_table_handle funchandle, bool nowait)
 {
     /* we assume this inits to all zeroes: */
     static const PgStat_FunctionCounts all_zeroes;
+    HASH_SEQ_STATUS scan;
+    PgStat_BackendFunctionEntry *bestat;
+    dshash_table *funchash = NULL;
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
+    /* nothing to do, just return */
     if (pgStatFunctions == NULL)
-        return;
+        return false;
+
+    have_function_stats = false;
+
+    /* dshash for function stats is created on-demand */
+    if (funchandle == DSM_HANDLE_INVALID)
+    {
+        PgStat_StatDBEntry *dbent =
+        pgstat_get_db_entry(MyDatabaseId, false, false, false);
+
+        Assert(dbent);
+
+        funchash = dshash_create(area, &dsh_funcparams, 0);
+
+        LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
+        if (dbent->functions == DSM_HANDLE_INVALID)
+            funchandle = dbent->functions =
+                dshash_get_hash_table_handle(funchash);
+        else
+        {
+            /* someone else simultaneously created it, discard mine. */
+            dshash_destroy(funchash);
+            funchandle = dbent->functions;
+        }
+        LWLockRelease(&dbent->lock);
+
+        /* dbent is no longer needed, release it right now */
+        dshash_release_lock(pgStatDBHash, dbent);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    if (funchash == NULL)
+        funchash = dshash_attach(area, &dsh_funcparams, funchandle, 0);
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    /*
+     * Scan through the pgStatFunctions to find functions that actually have
+     * counts, and try flushing it out to shared stats.
+     */
+    hash_seq_init(&scan, pgStatFunctions);
+    while ((bestat = (PgStat_BackendFunctionEntry *) hash_seq_search(&scan)) != NULL)
     {
-        PgStat_FunctionEntry *m_ent;
+        bool        found;
+        PgStat_StatFuncEntry *shstat = NULL;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
+        /* Skip it if no counts accumulated for it so far */
+        if (memcmp(&bestat->f_counts, &all_zeroes,
                    sizeof(PgStat_FunctionCounts)) == 0)
             continue;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+        shstat = (PgStat_StatFuncEntry *)
+            dshash_find_extended(funchash, (void *) &(bestat->f_id),
+                                 true, nowait, true, &found);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
+        /*
+         * We couldn't acquire lock on the required entry. Leave the local
+         * entry alone.
+         */
+        if (!shstat)
         {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
+            have_function_stats = true;
+            continue;
         }
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+        /* Initialize if it's new, or add to it. */
+        if (!found)
+        {
+            shstat->functionid = bestat->f_id;
+            shstat->f_numcalls = bestat->f_counts.f_numcalls;
+            shstat->f_total_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            shstat->f_self_time =
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        else
+        {
+            shstat->f_numcalls += bestat->f_counts.f_numcalls;
+            shstat->f_total_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_total_time);
+            shstat->f_self_time +=
+                INSTR_TIME_GET_MICROSEC(bestat->f_counts.f_self_time);
+        }
+        dshash_release_lock(funchash, shstat);
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
+        /* reset used counts */
+        MemSet(&bestat->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    }
 
-    have_function_stats = false;
+    return have_function_stats;
 }
 
 
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Remove objects we can get rid of.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    HTAB       *oidtab;
+    dshash_seq_status dshstat;
     PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
+    dshash_table_handle tables_handle;
+    dshash_table_handle functions_handle;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
     /*
      * Read pg_database and make a list of OIDs of all existing databases
      */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
 
     /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
+     * Search the database hash table for dead databases and drop them from
+     * the hash.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+
+    dshash_seq_init(&dshstat, pgStatDBHash, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1054,136 +955,48 @@ pgstat_vacuum_stat(void)
 
         /* the DB entry for shared tables (with InvalidOid) is never dropped */
         if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
+            hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
             pgstat_drop_database(dbid);
     }
 
     /* Clean up */
-    hash_destroy(htab);
+    dshash_seq_term(&dshstat);
+    hash_destroy(oidtab);
 
     /*
      * Lookup our own database entry; if not found, nothing more to do.
+     * MyDatabaseId cannot be removed or the hashes above are not changed, so
+     * we can release the lock just after.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    if (dbentry == NULL)
         return;
+    tables_handle = dbentry->tables;
+    functions_handle = dbentry->functions;
+    dshash_release_lock(pgStatDBHash, dbentry);
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
      */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
+    oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
 
     /*
-     * Send the rest
+     * Check for all tables listed in stats hash table if they still exist.
+     * Stats cache is useless here so directly search the shared hash.
      */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    pgstat_remove_useless_entries(tables_handle, &dsh_tblparams, oidtab);
+    hash_destroy(oidtab);
 
     /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
+     * Repeat the above but we needn't bother in the common case where no
+     * function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
+        oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+        pgstat_remove_useless_entries(functions_handle, &dsh_funcparams,
+                                      oidtab);
+        hash_destroy(oidtab);
     }
 }
 
@@ -1212,7 +1025,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1052,96 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
-/* ----------
- * pgstat_drop_database() -
+/*
+ * pgstat_remove_useless_entries - Remove entries no loner exists from per
+ * table/function dshashes.
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *  Scan the dshash specified by dshhandle removing entries that are not in
+ *  oidtab. oidtab is destroyed before returning.
  */
 void
-pgstat_drop_database(Oid databaseid)
+pgstat_remove_useless_entries(const dshash_table_handle dshhandle,
+                              const dshash_parameters *dshparams,
+                              HTAB *oidtab)
 {
-    PgStat_MsgDropdb msg;
+    dshash_table *dshtable;
+    dshash_seq_status dshstat;
+    void       *ent;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dshtable = dshash_attach(area, dshparams, dshhandle, 0);
+    dshash_seq_init(&dshstat, dshtable, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        CHECK_FOR_INTERRUPTS();
+
+        /* The first member of the entries must be Oid */
+        if (hash_search(oidtab, ent, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* Not there, so purge this entry */
+        dshash_delete_current(&dshstat);
+    }
 
+    /* clean up */
+    dshash_seq_term(&dshstat);
+    dshash_detach(dshtable);
+}
 
 /* ----------
- * pgstat_drop_relation() -
+ * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
+ *    Remove entry for the database that we just dropped.
  *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
+ *    If some stats are flushed after this, this entry will be re-created but we
+ *    will still clean the dead DB eventually via future invocations of
+ *    pgstat_vacuum_stat().
  * ----------
  */
-#ifdef NOT_USED
 void
-pgstat_drop_relation(Oid relid)
+pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgTabpurge msg;
-    int            len;
+    PgStat_StatDBEntry *dbentry;
+
+    Assert(OidIsValid(databaseid));
+
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return;
+
+    /*
+     * Lookup the database, removal needs exclusive lock.
+     */
+    dbentry = pgstat_get_db_entry(databaseid, true, false, false);
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (dbentry == NULL)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    /* found, remove it */
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    /* Remove table/function stats dshash first. */
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl =
+        dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
+        dshash_destroy(tbl);
+    }
 
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *tbl =
+        dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+
+        dshash_destroy(tbl);
+    }
+
+    dshash_delete_entry(pgStatDBHash, (void *) dbentry);
+}
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1150,30 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStat_StatDBEntry *dbentry;
+
+    if (!pgStatDBHash)
+        return;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /*
+     * Lookup the database in the hash table.  Nothing to do if not there.
+     * This function works on the dbentry, so we cannot release it earlier.
+     */
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    if (!dbentry)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset database-level stats. */
+    reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
+
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1182,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,17 +1221,39 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
+    dshash_table *t;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    Assert(dbentry);
+
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->lock);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /*
+     * MyDatabaseId cannot be removed or the hashes above are not changed, so
+     * we can release the lock right now.
+     */
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    /* Remove object if it exists, ignore if not */
+    if (type == RESET_TABLE)
+    {
+        t = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
 
-    pgstat_send(&msg, sizeof(msg));
+    if (type == RESET_FUNCTION && dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &objoid);
+        dshash_detach(t);
+    }
 }
 
 /* ----------
@@ -1383,48 +1267,87 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = pgstat_get_db_entry(dboid, false, false, true);
+    Assert(!dbentry);
+
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    Oid            dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table_handle table_handle;
+    dshash_table *table;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry. The dshash table cannot
+     * be destroyed meanwhile, so release the dbent right now.
+     */
+    dbentry = pgstat_get_db_entry(dboid, false, false, true);
+    Assert(dbentry);
+    table_handle = dbentry->tables;
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    ts = GetCurrentTimestamp();
+
+    table = dshash_attach(area, &dsh_tblparams, table_handle, 0);
+    tabentry = pgstat_get_tab_entry(table, tableoid, true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1358,14 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    Oid            dboid;
+    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabentry;
+    dshash_table_handle table_handle;
+    dshash_table *table;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1445,10 +1373,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1466,84 +1394,125 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    dboid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Store the data in the table's hash table entry. The dshash table cannot
+     * be destroyed meanwhile, so release the dbent right now.
+     */
+    dbentry = pgstat_get_db_entry(dboid, false, false, true);
+    Assert(dbentry);
+    table_handle = dbentry->tables;
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    table = dshash_attach(area, &dsh_tblparams, table_handle, 0);
+    tabentry = pgstat_get_tab_entry(table, RelationGetRelid(rel), true);
+
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-/* --------
- * pgstat_report_deadlock() -
- *
- *    Tell the collector about a deadlock detected.
- * --------
- */
-void
-pgstat_report_deadlock(void)
-{
-    PgStat_MsgDeadlock msg;
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    Assert(dbentry);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbentry->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbentry->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbentry->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbentry->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbentry->counts.n_conflict_startup_deadlock++;
+            break;
+    }
+    LWLockRelease(&dbentry->lock);
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_deadlock() -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Report a deadlock detected.
  * --------
  */
 void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
+pgstat_report_deadlock(void)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    Assert(dbentry);
 
-    pgstat_send(&msg, sizeof(msg));
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->counts.n_deadlocks++;
+    LWLockRelease(&dbentry->lock);
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
+
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
@@ -1555,60 +1524,153 @@ pgstat_report_checksum_failure(void)
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, false, false);
+    Assert(dbentry);
+
+    if (filesize > 0)            /* Is there a case where filesize is really 0? */
+    {
+        LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+        dbentry->counts.n_temp_bytes += filesize;    /* needs check overflow */
+        dbentry->counts.n_temp_files++;
+        LWLockRelease(&dbentry->lock);
+    }
+
+    dshash_release_lock(pgStatDBHash, dbentry);
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
+    ChecksumFailureEnt *failent = NULL;
+
+    /* return if we are not active */
+    if (!area)
+        return;
+
+    /* add accumulated count to the parameter */
+    if (checksum_failures != NULL)
+    {
+        failent = hash_search(checksum_failures, &dboid, HASH_FIND, NULL);
+        if (failent)
+            failurecount += failent->count;
+    }
+
+    if (failurecount == 0)
+        return;
+
+    dbentry = pgstat_get_db_entry(MyDatabaseId, false, true, false);
+
+    if (!dbentry)
+    {
+        /* failed to acquire shared entry, store the number locally */
+        if (!failent)
+        {
+            bool        found;
+
+            if (!checksum_failures)
+            {
+                HASHCTL        ctl;
+
+                ctl.keysize = sizeof(Oid);
+                ctl.entrysize = sizeof(ChecksumFailureEnt);
+                checksum_failures =
+                    hash_create("pgstat checksum failure count hash",
+                                32, &ctl, HASH_ELEM | HASH_BLOBS);
+            }
+
+            failent = hash_search(checksum_failures, &dboid, HASH_ENTER,
+                                  &found);
 
-    if (pgStatSock == PGINVALID_SOCKET)
+            if (!found)
+                nchecksum_failures++;
+        }
+
+        failent->count = failurecount;
         return;
+    }
+
+    /* We have a chance to flush immediately */
+    LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+    dbentry->counts.n_checksum_failures += failurecount;
+    LWLockRelease(&dbentry->lock);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dshash_release_lock(pgStatDBHash, dbentry);
+
+    if (checksum_failures)
+    {
+        /* Remove the entry and the hash if it gets empty */
+        hash_search(checksum_failures, &dboid, HASH_REMOVE, NULL);
+
+        if (failent != NULL && --nchecksum_failures < 1)
+        {
+            hash_destroy(checksum_failures);
+            checksum_failures = NULL;
+        }
+    }
 }
 
-/* ----------
- * pgstat_send_inquiry() -
+/*
+ * flush checkpoint failure count for all databases
  *
- *    Notify collector that we need fresh data.
- * ----------
+ *  Returns true if some entries are left unflushed.
  */
 static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
+pgstat_flush_checksum_failure(bool nowait)
 {
-    PgStat_MsgInquiry msg;
+    HASH_SEQ_STATUS stat;
+    ChecksumFailureEnt *ent;
+    PgStat_StatDBEntry *dbentry;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    if (checksum_failures == NULL)
+        return;
+
+    hash_seq_init(&stat, checksum_failures);
+    while ((ent = (ChecksumFailureEnt *) hash_seq_search(&stat)) != NULL)
+    {
+        dbentry = pgstat_get_db_entry(ent->dboid, false, nowait, true);
+        if (dbentry)
+        {
+            /* update database stats */
+            LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
+            dbentry->counts.n_checksum_failures += ent->count;
+            LWLockRelease(&dbentry->lock);
+
+            hash_search(checksum_failures, &ent->dboid, HASH_REMOVE, NULL);
+
+            dshash_release_lock(pgStatDBHash, dbentry);
+            nchecksum_failures--;
+        }
+    }
+
+    /* The hash is empty, destroy it. */
+    if (nchecksum_failures < 1)
+    {
+        hash_destroy(checksum_failures);
+        checksum_failures = NULL;
+    }
 
+    return;
+}
 
 /*
  * Initialize function call usage data.
@@ -1739,8 +1801,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1760,7 +1821,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1779,85 +1841,45 @@ pgstat_initstats(Relation rel)
     rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
+
 /*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
 static PgStat_TableStatus *
 get_tabstat_entry(Oid rel_id, bool isshared)
 {
-    TabStatHashEntry *hash_entry;
     PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
     bool        found;
 
     /*
      * Create hash table if we don't have it already.
      */
-    if (pgStatTabHash == NULL)
+    if (pgStatTables == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
+        MemSet(&ctl, 0, sizeof(ctl));
         ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
+        ctl.entrysize = sizeof(PgStat_TableStatus);
 
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        pgStatTables = hash_create("Table stat entries",
+                                   PGSTAT_TABLE_HASH_SIZE,
+                                   &ctl,
+                                   HASH_ELEM | HASH_BLOBS);
     }
 
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    entry = hash_search(pgStatTables, &rel_id, HASH_ENTER, &found);
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
+        entry->t_shared = isshared;
+        entry->trans = NULL;
+        MemSet(&entry->t_counts, 0, sizeof(PgStat_TableCounts));
     }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    have_table_stats = true;
 
     return entry;
 }
@@ -1866,26 +1888,16 @@ get_tabstat_entry(Oid rel_id, bool isshared)
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
  * If no entry, return NULL, don't create a new one
- *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
-
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
-
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
+    /* If hash table doesn't exist, there are no entries at all */
+    if (!pgStatTables)
         return NULL;
 
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return (PgStat_TableStatus *) hash_search(pgStatTables,
+                                              &rel_id, HASH_FIND, NULL);
 }
 
 /*
@@ -2315,9 +2327,9 @@ AtPrepare_PgStat(void)
  *        Clean up after successful PREPARE.
  *
  * All we need do here is unlink the transaction stats state from the
- * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * nontransactional state.  The nontransactional action counts will be reported
+ * immediately, while the effects on live and dead tuple counts are preserved
+ * in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,91 +2427,248 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
-/* ----------
- * pgstat_fetch_stat_dbentry() -
+/*
+ * snapshot_statentry() - Common routine for functions
+ *                             pgstat_fetch_stat_*entry()
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
- * ----------
+ *  Returns the pointer to a snapshot of a shared entry for the key or NULL if
+ *  not found. Returned snapshots are stable during the current transaction or
+ *  until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in HTAB *snapshot_hash. If not created yet, it
+ *  is created using snapshot_hash_name, snapshot_hash_entsize.
+ *
+ *  *table_hash points to dshash_table. If not yet attached, it is attached
+ *  using table_hash_params and table_hash_handle.
  */
-PgStat_StatDBEntry *
-pgstat_fetch_stat_dbentry(Oid dbid)
+static void *
+snapshot_statentry(const Oid key, const char *snapshot_hash_name,
+                   const int snapshot_hash_entsize,
+                   const dshash_table_handle table_hash_handle,
+                   const dshash_parameters *table_hash_params,
+                   HTAB **snapshot_hash, dshash_table **table_hash)
 {
+    PgStat_snapshot *lentry = NULL;
+    size_t        table_hash_keysize = table_hash_params->key_size;
+    size_t        table_hash_entrysize = table_hash_params->entry_size;
+    bool        found;
+
+    /*
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
+     */
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
+            snapshot_globalStats.stats_timestamp <
+            GetCurrentStatementStartTimestamp() -
+            PGSTAT_STAT_MIN_INTERVAL * 1000)
+        {
+            MemoryContextReset(pgStatSnapshotContext);
+
+            /* Reset variables */
+            global_snapshot_is_valid = false;
+            pgStatSnapshotContext = NULL;
+            pgStatLocalHash = NULL;
+
+            pgstat_setup_memcxt();
+            *snapshot_hash = NULL;
+        }
+    }
+
     /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
      */
-    backend_read_statsfile();
+    if (!*snapshot_hash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStat_snapshot.
+         */
+
+        ctl.keysize = table_hash_keysize;
+        ctl.entrysize =
+            offsetof(PgStat_snapshot, body) + snapshot_hash_entsize;
+        ctl.hcxt = pgStatSnapshotContext;
+        *snapshot_hash = hash_create(snapshot_hash_name, 32, &ctl,
+                                     HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    lentry = hash_search(*snapshot_hash, &key, HASH_ENTER, &found);
 
     /*
-     * Lookup the requested database; return NULL if not found
+     * Refer shared hash if not found in the local hash. We return up-to-date
+     * entries outside a transaction so do the same even if the snapshot is
+     * found.
      */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    if (!found || !IsTransactionState())
+    {
+        void       *sentry;
+
+        /* attach shared hash if not given, leave it alone for later use */
+        if (!*table_hash)
+        {
+            MemoryContext oldcxt;
+
+            Assert(table_hash_handle != DSM_HANDLE_INVALID);
+            oldcxt = MemoryContextSwitchTo(pgStatSnapshotContext);
+            *table_hash =
+                dshash_attach(area, table_hash_params, table_hash_handle, NULL);
+            MemoryContextSwitchTo(oldcxt);
+        }
+
+        sentry = dshash_find(*table_hash, &key, false);
+
+        if (sentry)
+        {
+            /*
+             * In transaction state, it is obvious that we should create local
+             * cache entries for consistency. If we are not, we return an
+             * up-to-date entry. Having said that, we need a local copy since
+             * dshash entry must be released immediately. We share the same
+             * local hash entry for the purpose.
+             */
+            memcpy(&lentry->body, sentry, table_hash_entrysize);
+            dshash_release_lock(*table_hash, sentry);
+
+            /* then zero out the local additional space if any */
+            if (table_hash_entrysize < snapshot_hash_entsize)
+                MemSet((char *) &lentry->body + table_hash_entrysize, 0,
+                       snapshot_hash_entsize - table_hash_entrysize);
+        }
+
+        lentry->negative = !sentry;
+    }
+
+    if (lentry->negative)
+        return NULL;
+
+    return &lentry->body;
 }
 
 
+/* ----------
+ * pgstat_fetch_stat_dbentry_snapshot() -
+ *
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatDBEntry *
+pgstat_fetch_stat_dbentry(Oid dbid)
+{
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(dbid,
+                           "local database stats",    /* snapshot hash name */
+                           sizeof(PgStat_StatDBEntry),    /* snapshot ent size */
+                           DSM_HANDLE_INVALID,    /* dshash handle  */
+                           &dsh_dbparams,    /* dshash params */
+                           &pgStatLocalHash,    /* snapshot hash */
+                           &pgStatDBHash);    /* shared hash */
+}
+
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* Lookup our database, then look in its table hash table. */
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+    if (dbentry == NULL)
+        return NULL;
 
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    dbentry = pgstat_fetch_stat_dbentry(InvalidOid);
+    if (dbentry == NULL)
+        return NULL;
+
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(dbentry, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     return NULL;
 }
 
 
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(PgStat_StatDBEntry *dbent, Oid reloid)
+{
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(reloid,
+                           "table stats snapshot",    /* snapshot hash name */
+                           sizeof(PgStat_StatTabEntry), /* snapshot ent size */
+                           dbent->tables,    /* dshash handle  */
+                           &dsh_tblparams,    /* dshash params */
+                           &dbent->snapshot_tables, /* snapshot hash */
+                           &dbent->dshash_tables);    /* shared hash */
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Sets index counters to specified
+ *    place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_funcentry() -
  *
@@ -2513,49 +2682,103 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
+    /* Lookup our database, then find the requested function */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    if (dbentry == NULL)
+        return NULL;
+
+    funcentry = pgstat_fetch_stat_funcentry_snapshot(dbentry, func_id);
 
     return funcentry;
 }
 
-
 /* ----------
- * pgstat_fetch_stat_beentry() -
- *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    our local copy of the current-activity entry for one backend.
+ * pgstat_fetch_stat_funcentry_snapshot() -
  *
- *    NB: caller is responsible for a check if the user is permitted to see
- *    this info (especially the querystring).
- * ----------
+ *    Find function stats entry on backends in dbent. The returned entry is
+ *    cached until transaction end or pgstat_clear_snapshot() is called.
  */
-PgBackendStatus *
-pgstat_fetch_stat_beentry(int beid)
+static PgStat_StatFuncEntry *
+pgstat_fetch_stat_funcentry_snapshot(PgStat_StatDBEntry *dbent, Oid funcid)
 {
-    pgstat_read_current_status();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    if (beid < 1 || beid > localNumBackends)
+    if (dbent->functions == DSM_HANDLE_INVALID)
         return NULL;
 
-    return &localBackendStatusTable[beid - 1].backendStatus;
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(funcid,
+                           "function stats snapshot",    /* snapshot hash name */
+                           sizeof(PgStat_StatFuncEntry),    /* snapshot ent size */
+                           dbent->functions,    /* dshash handle  */
+                           &dsh_funcparams, /* dshash params */
+                           &dbent->snapshot_functions,    /* snapshot hash */
+                           &dbent->dshash_functions);    /* shared hash */
 }
 
-
-/* ----------
- * pgstat_fetch_stat_local_beentry() -
- *
- *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
- *    xid and xmin values of the backend)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    pgstat_attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
+
+/* ----------
+ * pgstat_fetch_stat_beentry() -
+ *
+ *    Support function for the SQL-callable pgstat* functions. Returns
+ *    our local copy of the current-activity entry for one backend.
+ *
+ *    NB: caller is responsible for a check if the user is permitted to see
+ *    this info (especially the querystring).
+ * ----------
+ */
+PgBackendStatus *
+pgstat_fetch_stat_beentry(int beid)
+{
+    pgstat_read_current_status();
+
+    if (beid < 1 || beid > localNumBackends)
+        return NULL;
+
+    return &localBackendStatusTable[beid - 1].backendStatus;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_local_beentry() -
+ *
+ *    Like pgstat_fetch_stat_beentry() but with locally computed additions (like
+ *    xid and xmin values of the backend)
  *
  *    NB: caller is responsible for a check if the user is permitted to see
  *    this info (especially the querystring).
@@ -2599,9 +2822,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2840,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3057,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3234,16 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+
+    /* attach shared database stats area */
+    pgstat_attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3256,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3044,6 +3273,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    pgstat_detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3535,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3599,9 +3831,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_ALL:
             event_name = "RecoveryWalAll";
             break;
@@ -4221,94 +4450,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
- *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
+ * pgstat_report_archiver() -
  *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *s = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for  a completely empty message.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += s->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += s->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += s->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += s->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += s->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += s->buf_written_clean;
+    shared_globalStats->maxwritten_clean += s->maxwritten_clean;
+    shared_globalStats->buf_written_backend += s->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += s->buf_fsync_backend;
+    shared_globalStats->buf_alloc += s->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4317,422 +4523,162 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+static void
+init_dbentry(PgStat_StatDBEntry *dbentry)
 {
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
+    dshash_table *tabhash;
 
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
+    LWLockInitialize(&dbentry->lock, LWTRANCHE_STATS);
 
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
 
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
+    dbentry->functions = DSM_HANDLE_INVALID;
 
-    exit(0);
+    /* dbentry always has the table hash */
+    tabhash = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tabhash);
+    dshash_detach(tabhash);
 }
 
+
 /*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
+    int            printed;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+
+static void
+reset_tabcount(PgStat_StatTabEntry *ent)
 {
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+    ent->numscans = 0;
+    ent->tuples_returned = 0;
+    ent->tuples_fetched = 0;
+    ent->tuples_inserted = 0;
+    ent->tuples_updated = 0;
+    ent->tuples_deleted = 0;
+    ent->tuples_hot_updated = 0;
+    ent->n_live_tuples = 0;
+    ent->n_dead_tuples = 0;
+    ent->changes_since_analyze = 0;
+    ent->blocks_fetched = 0;
+    ent->blocks_hit = 0;
+    ent->vacuum_count = 0;
+    ent->autovac_vacuum_count = 0;
+    ent->analyze_count = 0;
+    ent->autovac_analyze_count = 0;
+
+    ent->vacuum_timestamp = 0;
+    ent->autovac_vacuum_timestamp = 0;
+    ent->analyze_timestamp = 0;
+    ent->autovac_analyze_timestamp = 0;
+}
 
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
 
-    if (!create && !found)
-        return NULL;
+static void
+reset_dbcount(PgStat_StatDBEntry *ent)
+{
+    TimestampTz ts = GetCurrentTimestamp();
+
+    LWLockAcquire(&ent->lock, LW_EXCLUSIVE);
+
+    ent->counts.n_tuples_returned = 0;
+    ent->counts.n_tuples_fetched = 0;
+    ent->counts.n_tuples_inserted = 0;
+    ent->counts.n_tuples_updated = 0;
+    ent->counts.n_tuples_deleted = 0;
+    ent->counts.n_blocks_fetched = 0;
+    ent->counts.n_blocks_hit = 0;
+    ent->counts.n_xact_commit = 0;
+    ent->counts.n_xact_rollback = 0;
+    ent->counts.n_block_read_time = 0;
+    ent->counts.n_block_write_time = 0;
+    ent->stat_reset_timestamp = ts;
+
+    LWLockRelease(&ent->lock);
+}
 
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
 
-    return result;
+static void
+reset_funccount(PgStat_StatFuncEntry *ent)
+{
+    ent->f_numcalls = 0;
+    ent->f_total_time = 0;
+    ent->f_self_time = 0;
 }
 
 
 /*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ * Subroutine to clear stats in a database entry
+ *
+ * Reset all counters in the dbentry.
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static void
+reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
+    dshash_table *tbl;
+    dshash_seq_status dshstat;
+    PgStat_StatTabEntry *tabent;
+    PgStat_StatFuncEntry *funcent;
+
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&dshstat, tbl, true);
+    while ((tabent = dshash_seq_next(&dshstat)) != NULL)
+        reset_tabcount(tabent);
+    dshash_seq_term(&dshstat);
+    dshash_detach(tbl);
+
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
+        tbl = dshash_attach(area, &dsh_tblparams, dbentry->functions, 0);
+        dshash_seq_init(&dshstat, tbl, true);
+        while ((funcent = dshash_seq_next(&dshstat)) != NULL)
+            reset_funccount(funcent);
+        dshash_seq_term(&dshstat);
+        dshash_detach(tbl);
     }
 
-    return result;
+    reset_dbcount(dbentry);
 }
 
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
     /*
@@ -4751,7 +4697,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4763,32 +4709,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatDBHash, false);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_pgStatDBHashfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4798,6 +4741,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4831,53 +4775,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
+ * pgstat_write_pgStatDBHashfile() -
  *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4886,9 +4796,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4915,23 +4826,34 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    Assert(dbentry->tables != DSM_HANDLE_INVALID);
+
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl, false);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&tstat);
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    if (dbentry->functions != DSM_HANDLE_INVALID)
     {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
+        tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_seq_init(&fstat, tbl, false);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
+        {
+            fputc('F', fpout);
+            rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+            (void) rc;            /* we'll check for error with ferror */
+        }
+        dshash_seq_term(&fstat);
+        dshash_detach(tbl);
     }
 
     /*
@@ -4966,94 +4888,56 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5062,7 +4946,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5070,38 +4954,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5115,7 +4991,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5124,76 +5000,36 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(pgStatDBHash, (void *) &dbbuf.databaseid,
+                                          &found);
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(pgStatDBHash, dbentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                init_dbentry(dbentry);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                Assert(dbentry->tables != DSM_HANDLE_INVALID);
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_pgStatDBHashfile(dbentry);
+                dshash_release_lock(pgStatDBHash, dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5203,59 +5039,49 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
+ * pgstat_read_pgStatDBHashfile() -
  *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_pgStatDBHashfile(PgStat_StatDBEntry *dbentry)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5268,14 +5094,17 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
+    /* Create table stats hash */
+    Assert(dbentry->tables != DSM_HANDLE_INVALID);
+
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing statistics file. Read it and put all the hash
+     * table entries into place.
      */
     for (;;)
     {
@@ -5288,31 +5117,32 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
+                if (!tabhash)
+                    tabhash = dshash_attach(area, &dsh_tblparams,
+                                            dbentry->tables, 0);
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(tabhash, tabentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5322,31 +5152,34 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
                 if (funchash == NULL)
-                    break;
+                {
+                    funchash = dshash_create(area, &dsh_tblparams, 0);
+                    dbentry->functions =
+                        dshash_get_hash_table_handle(funchash);
+                }
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    dshash_release_lock(funchash, funcentry);
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5356,7 +5189,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5364,292 +5197,38 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 }
 
 
@@ -5668,739 +5247,185 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
 
+    if (pgStatSnapshotContext)
+        clear_snapshot = true;
+}
 
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+static bool
+pgstat_update_tabentry(dshash_table *tabhash, PgStat_TableStatus *stat,
+                       bool nowait)
 {
-    PgStat_StatDBEntry *dbentry;
+    PgStat_StatTabEntry *tabent;
+    bool        found;
 
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
+    if (tabhash == NULL)
+        return false;
 
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
+    tabent = (PgStat_StatTabEntry *)
+        dshash_find_extended(tabhash, (void *) &(stat->t_id),
+                             true, nowait, true, &found);
 
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
+    /* failed to acquire lock */
+    if (tabent == NULL)
+        return false;
+
+    if (!found)
     {
         /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
+         * If it's a new table entry, initialize counters to the values we
+         * just got.
          */
+        tabent->numscans = stat->t_counts.t_numscans;
+        tabent->tuples_returned = stat->t_counts.t_tuples_returned;
+        tabent->tuples_fetched = stat->t_counts.t_tuples_fetched;
+        tabent->tuples_inserted = stat->t_counts.t_tuples_inserted;
+        tabent->tuples_updated = stat->t_counts.t_tuples_updated;
+        tabent->tuples_deleted = stat->t_counts.t_tuples_deleted;
+        tabent->tuples_hot_updated = stat->t_counts.t_tuples_hot_updated;
+        tabent->n_live_tuples = stat->t_counts.t_delta_live_tuples;
+        tabent->n_dead_tuples = stat->t_counts.t_delta_dead_tuples;
+        tabent->changes_since_analyze = stat->t_counts.t_changed_tuples;
+        tabent->blocks_fetched = stat->t_counts.t_blocks_fetched;
+        tabent->blocks_hit = stat->t_counts.t_blocks_hit;
+
+        tabent->vacuum_timestamp = 0;
+        tabent->vacuum_count = 0;
+        tabent->autovac_vacuum_timestamp = 0;
+        tabent->autovac_vacuum_count = 0;
+        tabent->analyze_timestamp = 0;
+        tabent->analyze_count = 0;
+        tabent->autovac_analyze_timestamp = 0;
+        tabent->autovac_analyze_count = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
+    else
     {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
+        /*
+         * Otherwise add the values to the existing entry.
+         */
+        tabent->numscans += stat->t_counts.t_numscans;
+        tabent->tuples_returned += stat->t_counts.t_tuples_returned;
+        tabent->tuples_fetched += stat->t_counts.t_tuples_fetched;
+        tabent->tuples_inserted += stat->t_counts.t_tuples_inserted;
+        tabent->tuples_updated += stat->t_counts.t_tuples_updated;
+        tabent->tuples_deleted += stat->t_counts.t_tuples_deleted;
+        tabent->tuples_hot_updated += stat->t_counts.t_tuples_hot_updated;
+        /* If table was truncated, first reset the live/dead counters */
+        if (stat->t_counts.t_truncated)
         {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
+            tabent->n_live_tuples = 0;
+            tabent->n_dead_tuples = 0;
         }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        tabent->n_live_tuples += stat->t_counts.t_delta_live_tuples;
+        tabent->n_dead_tuples += stat->t_counts.t_delta_dead_tuples;
+        tabent->changes_since_analyze += stat->t_counts.t_changed_tuples;
+        tabent->blocks_fetched += stat->t_counts.t_blocks_fetched;
+        tabent->blocks_hit += stat->t_counts.t_blocks_hit;
     }
 
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    tabent->n_live_tuples = Max(tabent->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    tabent->n_dead_tuples = Max(tabent->n_dead_tuples, 0);
+
+    dshash_release_lock(tabhash, tabent);
+
+    return true;
 }
 
 
-/* ----------
- * pgstat_recv_tabstat() -
+/*
+ * Lookup shared stats hash table for the specified database. Returns NULL
+ * when PGSTAT_NOWAIT and required lock cannot be acquired.
  *
- *    Count what the backend has done.
- * ----------
  */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
+static PgStat_StatDBEntry *
+pgstat_get_db_entry(Oid databaseid, bool exclusive, bool nowait, bool create)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
+    PgStat_StatDBEntry *result;
+    bool        found = true;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
+    if (!IsUnderPostmaster || !pgStatDBHash)
+        return NULL;
 
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+    /* Lookup or create the hash table entry for this database */
+    result = (PgStat_StatDBEntry *)
+        dshash_find_extended(pgStatDBHash, &databaseid,
+                             exclusive, nowait, create, &found);
 
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
+    if (result == NULL)
+        return NULL;
 
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+    if (create && !found)
+    {
+        Assert(create);
 
         /*
-         * Add per-table stats to the per-database entry, too.
+         * Initialize the new entry.  This creates empty hash tables hash,
+         * too.
          */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+        init_dbentry(result);
     }
-}
 
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
+    return result;
 }
 
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
+/*
+ * Lookup the hash table entry for the specified table. Returned entry is
+ * exclusive locked.
+ * If no hash table entry exists, creates it, if create is true.
+ * Else, returns NULL.
  */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
+static PgStat_StatTabEntry *
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
+    PgStat_StatTabEntry *result;
     bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+    /* Lookup or create the hash table entry for this table */
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
+    if (!create && !found)
+        return NULL;
 
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    /* If not found, initialize the new one. */
+    if (!found)
     {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        result->numscans = 0;
+        result->tuples_returned = 0;
+        result->tuples_fetched = 0;
+        result->tuples_inserted = 0;
+        result->tuples_updated = 0;
+        result->tuples_deleted = 0;
+        result->tuples_hot_updated = 0;
+        result->n_live_tuples = 0;
+        result->n_dead_tuples = 0;
+        result->changes_since_analyze = 0;
+        result->blocks_fetched = 0;
+        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;
+        result->vacuum_count = 0;
+        result->autovac_vacuum_timestamp = 0;
+        result->autovac_vacuum_count = 0;
+        result->analyze_timestamp = 0;
+        result->analyze_count = 0;
+        result->autovac_analyze_timestamp = 0;
+        result->autovac_analyze_count = 0;
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
 
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    return result;
 }
 
 /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fab4a9dd51..d418fe3bd0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..04445c4c76 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2000,7 +2000,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2110,7 +2110,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2300,7 +2300,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2308,7 +2308,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 00c77b66c7..e2998f965e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cea01534a5..a1304dc3ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1244,7 +1241,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1405,7 +1402,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1436,7 +1433,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1451,7 +1448,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1466,7 +1463,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1481,7 +1478,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1496,7 +1493,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1511,11 +1508,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1530,7 +1527,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1548,7 +1545,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1585,7 +1582,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1a19921f80..4e137140bd 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -116,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -181,280 +148,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
 /* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgBgWriter
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +205,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +215,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -600,7 +227,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -610,29 +236,52 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+}            PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
+    LWLock        lock;            /* Lock for the above members */
+
+    /* non-shared members */
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
 } PgStat_StatDBEntry;
 
+#define SHARED_DBENT_SIZE offsetof(PgStat_StatDBEntry, snapshot_tables)
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -651,19 +300,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +323,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +339,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -760,7 +405,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_ALL,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
@@ -1001,7 +645,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1198,13 +842,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1219,29 +865,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
-extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void pgstat_reset_all(void);
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1402,8 +1045,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1412,11 +1055,14 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(PgStat_StatDBEntry *dbent, Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From bfc8b896ed12d29b8185a7053b3ed586b23e2487 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v25 7/8] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c6f95fa688..12c8d19ccb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8135,9 +8135,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3cac340f32..8cd86beb9d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6944,11 +6944,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -6964,14 +6964,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7002,9 +7001,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8022,7 +8021,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8035,7 +8034,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..d56afa17db 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2357,12 +2357,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 987580d6df..9605e0ebd4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (500 ms unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="13"><literal>Activity</literal></entry>
+         <entry morerows="12"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalAll</literal></entry>
          <entry>Waiting for WAL from any kind of source (local, archive or stream) at recovery.</entry>
@@ -4156,9 +4148,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 13bd320b31..52c61d222a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1259,11 +1259,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From bb7d2f7184169280fa45c3fa6e69776d37a6de4a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v25 8/8] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cd86beb9d..7f6056b9e9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7056,25 +7056,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 34a4005791..4cd8530e91 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -96,15 +96,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /* Shared stats bootstrap information, protected by StatsLock */
 typedef struct StatsShmemStruct
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 806d013108..c086ab781b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -251,15 +251,12 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
     backup_streamed = 0;
     pgstat_progress_start_command(PROGRESS_COMMAND_BASEBACKUP, InvalidOid);
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -291,13 +288,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c6d648662..417fbbdc5d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4193,17 +4192,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11489,35 +11477,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..207e042e99 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -573,7 +573,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e137140bd..062f393941 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..f3340f726c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > I think we also can get rid of the dshash_delete changes, by instead
> > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > or such.
>
> [009] (Fixed)
> I'm not sure about the point of having two interfaces that are hard to
> distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
> enough(). I also reverted the dshash_delete().

Well, dshash_delete() cannot generally safely be used together with
iteration. It has to be the current element etc. And I think the locking
changes make dshash less robust. By explicitly tying "delete the current
element" to the iterator, most of that can be avoided.



> > >  /* SIGUSR1 signal handler for archiver process */
> >
> > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > somewhere.
> >
> > We can probably just convert to using normal latches here, and remove
> > the current 'wakened' logic? That'll remove the indirection via
> > postmaster too, which is nice.
>
> [018] (Fixed, separate patch 0005)
> It seems better. I added it as a separate patch just after the patch
> that turns archiver an auxiliary process.

I don't think it's correct to do it separately, but I can just merge
that on commit.


> > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > >
> > >      switch (backendType)
> > >      {
> > > +        case B_ARCHIVER:
> > > +            backendDesc = "archiver";
> > > +            break;
> >
> > should imo include 'WAL' or such.
>
> [019] (Not Fixed)
> It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> patch set?

Oh. No, don't rename it as part of this. Could you reply to the thread
in which Peter made that change, and reference this complaint?


> [021] (Fixed, separate patch 0007)
> However the "statistics collector process" is gone, I'm not sure
> "statistics collector" feature also is gone. But actually the word
> "collector" looks a bit odd in some context. I replaced "the results
> of statistics collector" with "the activity statistics". (I'm not sure
> "the activity statistics" is proper as a subsystem name.) The word
> "collect" is replaced with "track".  I didn't change section IDs
> corresponding to the renaming so that old links can work. I also fixed
> the tranche name for LWTRANCHE_STATS from "activity stats" to
> "activity_statistics"

Without having gone through the changes, that sounds like the correct
direction to me. There's no "collector" anymore, so removing that seems
like the right thing.


> > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > index ca5c6376e5..1ffe073a1f 100644
> > > --- a/src/backend/postmaster/pgstat.c
> > > +++ b/src/backend/postmaster/pgstat.c
> > > + *  Collects per-table and per-function usage statistics of all backends on
> > > + *  shared memory. pg_count_*() and friends are the interface to locally store
> > > + *  backend activities during a transaction. Then pgstat_flush_stat() is called
> > > + *  at the end of a transaction to pulish the local stats on shared memory.
> > >   *
> >
> > I'd rather not exhaustively list the different objects this handles -
> > it'll either be annoying to maintain, or just get out of date.
>
> [024] (Fixed, Maybe)
> Although not sure I get you correctly, I rewrote it as the follows.
>
>  *  Collects per-table and per-function usage statistics of all backends on
>  *  shared memory. The activity numbers are once stored locally, then written
>  *  to shared memory at commit time or by idle-timeout.

s/backends on/backends in/

I was thinking of something like:
 *  Collects activity statistics, e.g. per-table access statistics, of
 *  all backends in shared memory. The activity numbers are first stored
 *  locally in each process, then flushed to shared memory at commit
 *  time or by idle-timeout.



> > > - *            - Add some automatic call for pgstat vacuuming.
> > > + *  To avoid congestion on the shared memory, we update shared stats no more
> > > + *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
> > > + *  all the local numbers cannot be flushed immediately, we postpone updates
> > > + *  and try the next chance after the interval of
> > > + *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
> > > + *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
> >
> > I'm not convinced by this backoff logic. The basic interval seems quite
> > high for something going through shared memory, and the max retry seems
> > pretty low.
>
> [025] (Not Fixed)
> Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> 10000) reasonable?

Partially. I think for access to shared resources we want *increasing*
wait times, rather than shorter retry timeout. The goal should be to be
to make it more likely for all processes to be able to flush their
stats, which can be achieved by flushing less often after hitting
contention.

> > > +/*
> > > + * BgWriter global statistics counters. The name cntains a remnant from the
> > > + * time when the stats collector was a dedicate process, which used sockets to
> > > + * send it.
> > > + */
> > > +PgStat_MsgBgWriter BgWriterStats = {0};
> >
> > I am strongly against keeping the 'Msg' prefix. That seems extremely
> > confusing going forward.
>
> [029] (Fixed) (Related  to [046])
> Mmm. It's following your old suggestion to avoid unsubstantial
> diffs. I'm happy to change it. The functions that have "send" in their
> names are for the same reason. I removed the prefix "m_" of the
> members of the struct. (The comment above (with a typo) explains that).

I don't object to having the rename be a separate patch...


> > > +    if (StatsShmem->refcount > 0)
> > > +        StatsShmem->refcount++;
> >
> > What prevents us from leaking the refcount here? We could e.g. error out
> > while attaching, no? Which'd mean we'd leak the refcount.
>
> [033] (Fixed)
> We don't attach shared stats on postmaster process, so I want to know
> the first attacher process and the last detacher process of shared
> stats.  It's not leaks that I'm considering here.
> (continued below)
>
> > To me it looks like there's a lot of added complexity just because you
> > want to be able to reset stats via
> >
> > void
> > pgstat_reset_all(void)
> > {
> >
> >     /*
> >      * We could directly remove files and recreate the shared memory area. But
> >      * detach then attach for simplicity.
> >      */
> >     pgstat_detach_shared_stats(false);    /* Don't write */
> >     pgstat_attach_shared_stats();
> >
> > Without that you'd not need the complexity of attaching, detaching to
> > the same degree - every backend could just cache lookup data during
> > initialization, instead of having to constantly re-compute that.
>
> Mmm. I don't get that (or I failed to read clear meaning). The
> function is assumed be called only from StartupXLOG().
> (continued)

Oh? I didn't get that you're only using it for that purpose - there's
very little documentation about what it's trying to do.

I don't see why that means we don't need to accurately track the
refcount? Otherwise we'll forget to write out the stats.


> > Nor would the dynamic re-creation of the db dshash table be needed.
>
> Maybe you are mentioning the complexity of reset_dbentry_counters? It
> is actually complex.  Shared stats dshash cannot be destroyed (or
> dshash entry cannot be removed) during someone is working on it. It
> was simpler to wait for another process to end its work but that could
> slow not only the clearing process but also other processes by
> frequent resetting of counters.

I was referring to the fact that the last version of the patch
attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
attach_table_hash, attach_function_hash etc.


> After some thoughts, I decided to rip the all "generation" stuff off
> and it gets far simpler. But counter reset may conflict with other
> backends with a litter higher degree because counter reset needs
> exclusive lock.

That seems harmless to me - stats reset should never happen at a high
enough frequency to make contention it causes problematic. There's also
an argument to be made that it makes sense for the reset to be atomic.


> > > +    /* Flush out table stats */
> > > +    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > +        pending_stats = true;
> > > +
> > > +    /* Flush out function stats */
> > > +    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > +        pending_stats = true;
> >
> > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > being confusing while reading the code, it also made the diff much
> > harder to read.
>
> [035] (Maybe Fixed)
> Is the question that, is there any case where
> pgstat_flush_stat/functions leaves some counters unflushed?

No, the point is that there's knowledge about
pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
namely the pgStatTabList, pgStatFunctions lists.


> > Why do we still have this? A hashtable lookup is cheap, compared to
> > fetching a file - so it's not to save time. Given how infrequent the
> > pgstat_fetch_* calls are, it's not to avoid contention either.
> >
> > At first one could think it's for consistency - but no, that's not it
> > either, because snapshot_statentry() refetches the snapshot without
> > control from the outside:
>
> [038]
> I don't get the second paragraph. When the function re*create*s a
> snapshot without control from the outside? It keeps snapshots during a
> transaction.  If not, it is broken.
> (continued)

Maybe I just misunderstood the code flow - partially due to the global
clear_snapshot variable. I just had read the
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.

comment, and took it to mean that you're unconditionally updating the
snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
have consistent snapshot across all fetches.

(partially this might have been due to the diff:
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
)

But I think my question remains: Why do we need the whole snapshot thing
now? Previously we needed to avoid reading a potentially large file -
but that's not a concern anymore?


> > >   /*
> > >    * We don't want so frequent update of stats snapshot. Keep it at least
> > >    * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> > >    */
> ...
> > I think we should just remove this entire local caching snapshot layer
> > for lookups.
>
> Currently the behavior is documented as the follows and it seems reasonable.
>
>    Another important point is that when a server process is asked to display
>    any of these statistics, it first fetches the most recent report emitted by
>    the collector process and then continues to use this snapshot for all
>    statistical views and functions until the end of its current transaction.
>    So the statistics will show static information as long as you continue the
>    current transaction.  Similarly, information about the current queries of
>    all sessions is collected when any such information is first requested
>    within a transaction, and the same information will be displayed throughout
>    the transaction.
>    This is a feature, not a bug, because it allows you to perform several
>    queries on the statistics and correlate the results without worrying that
>    the numbers are changing underneath you.  But if you want to see new
>    results with each query, be sure to do the queries outside any transaction
>    block.  Alternatively, you can invoke
>    <function>pg_stat_clear_snapshot</function>(), which will discard the
>    current transaction's statistics snapshot (if any).  The next use of
>    statistical information will cause a new snapshot to be fetched.

I am very unconvinded this is worth the cost. Especially because plenty
of other stats related parts of the system do *NOT* behave this way. How
is a user supposed to understand that pg_stat_database behaves one way,
pg_stat_activity, another, pg_stat_statements a third,
pg_stat_progress_* ...

Perhaps it's best to not touch the semantics here, but I'm also very
wary of introducing significant complications and overhead just to have
this "feature".


> > >       for (i = 0; i < tsa->tsa_used; i++)
> > >       {
> > >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > >
> <many TableStatsArray code>
> > >               hash_entry->tsa_entry = entry;
> > >               dest_elem++;
> > >           }
> >
> > This seems like too much code. Why is this entirely different from the
> > way funcstats works? The difference was already too big before, but this
> > made it *way* worse.
>
> [040]
> We don't flush stats until transaction ends. So the description about
> TabStatuArray is stale?

How is your comment related to my comment above?


> > > bool
> > > pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
> > >                    PgStat_TableStatus *entry)
> > > {
> > >   Oid     dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
> > >   int     table_mode = PGSTAT_EXCLUSIVE;
> > >   bool    updated = false;
> > >   dshash_table *tabhash;
> > >   PgStat_StatDBEntry *dbent;
> > >   int     generation;
> > >
> > >   if (nowait)
> > >       table_mode |= PGSTAT_NOWAIT;
> > >
> > >   /* Attach required table hash if not yet. */
> > >   if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
> > >   {
> > >       /*
> > >        *  Return if we don't have corresponding dbentry. It would've been
> > >        *  removed.
> > >        */
> > >       dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
> > >       if (!dbent)
> > >           return false;
> > >
> > >       /*
> > >        * We don't hold lock on the dbentry since it cannot be dropped while
> > >        * we are working on it.
> > >        */
> > >       generation = pin_hashes(dbent);
> > >       tabhash = attach_table_hash(dbent, generation);
> >
> > This again is just cost incurred by insisting on destroying hashtables
> > instead of keeping them around as long as necessary.
>
> [040]
> Maybe you are insisting the reverse? The pin_hash complexity is left
> in this version. -> [033]

What do you mean? What I'm saying is that we should never end up in a
situation where there's no pgstat entry for the current database. And
that that's trivial, as long as we don't drop the hashtable, but instead
reset counters to 0.


> > >   dbentry = pgstat_get_db_entry(MyDatabaseId,
> > >                                 PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
> > >                                 &status);
> > >
> > >   if (status == LOCK_FAILED)
> > >       return;
> > >
> > >   /* We had a chance to flush immediately */
> > >   pgstat_flush_recovery_conflict(dbentry);
> > >
> > >   dshash_release_lock(pgStatDBHash, dbentry);
> >
> > But I don't understand why? Nor why we'd not just report all pending
> > database wide changes in that case?
> >
> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
>
> [043] (Maybe fixed) (Related to [045].)
> Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> conditional lock is mainly used by
> pgstat_report_stat().

You're saying "cannot be delayed" - but you're not explaining *why* that
is.

Even if true, I don't see why that necessitates doing the flushing and
locking once for each of these functions?


> dshash_find_or_insert didn't allow shared lock. I changed
> dshash_find_extended to allow shared-lock even if it is told to create
> a missing entry. Alrhough it takes exclusive lock at the mement of
> entry creation, most of all cases it doesn't need exclusive lock. This
> allows use shared lock while processing vacuum or analyze stats.

Huh?


> Previously I thought that we can work on a shared database entry while
> lock is not held, but actually there are cases where insertion of a
> new database entry causes rehash (resize). The operation moves entries
> so we need at least shared lock on database entry while we are working
> on it.  So in the attched basically most operations are working by the
> following steps.
> - get shared database entry with shared lock
>   - attach table/function hash
>     - fetch an entry with exclusive lock
>       - update entry
>     - release the table/function entry
>   - detach table/function hash
>   if needed
>     - take LW_EXCLUSIVE on database entry
>       - update database numbers
>     - release LWLock
> - release shared database entry

Just to be crystal clear: I am exceedingly unlikely to commit this with
any sort of short term attach/detach operations. Both because of the
runtime overhead/contention it causes is significant, and because of the
code complexity implied by it.


Leaving attach/detach aside: I think it's a complete no-go to acquire
database wide locks at this frequency, and then to hold them over other
operations that are a) not cheap b) can block. The contention due to
that would be *terrible* for scalability, even if it's just a shared
lock.

The way this *should* work is that:
1.1) At backend startup, attach to the database wide hashtable
1.2) At backend startup, attach to the various per-database hashtables
  (including ones for shared tables)
2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
3.1) When shutting down backend, detach from all hashtables


That way we never need to hold onto the database-wide hashtables for
long, and we can do it with conditional locks (trylock above), unless we
need to force flushing.

It might be worthwhile to merge per-table, per-function, per-database
hashes into a single hash. Where the key is either something like
{hashkind, objoid} (referenced from a per-database hashtable), or even
{hashkind, dboid, objoid} (one global hashtable).


I think the contents of the hashtable should likely just be a single
dsa_pointer (plus some bookkeeping). Several reasons for that:

1) Since one goal of this is to make the stats system more extensible,
  it seems important that we can make the set of stats kept
  runtime configurable. Otherwise everyone will continue to have to pay
  the price for every potential stat that we have an option to track.

2) Having hashtable resizes move fairly large stat entries around is
   expensive. Whereas just moving key + dsa_pointer around is pretty
   cheap. I don't think the cost of a pointer dereference matters in
   *this* case.

3) If the stats contents aren't moved around, there's no need to worry
   about hashtable resizes. Therefore the stats can be referenced
   without holding dshash partition locks.

4) If the stats entries aren't moved around by hashtable resizes, we can
   use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
   not generally correct/safe to have dshash resize to move those
   around.


All of that would be addressed if we instead allocate the stats data
separately from the dshash entry.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-03-19 16:51:59 +1300, Thomas Munro wrote:
> On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> > Thomas, could you look at the first two patches here, and my review
> > questions?
> 
> Ack.

Thanks!


> > >               dsa_pointer item_pointer = hash_table->buckets[i];
> > > @@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
> > >                                                               LW_EXCLUSIVE));
> > >
> > >       delete_item(hash_table, item);
> > > -     hash_table->find_locked = false;
> > > -     hash_table->find_exclusively_locked = false;
> > > -     LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +
> > > +     /* We need to keep partition lock while sequential scan */
> > > +     if (!hash_table->seqscan_running)
> > > +     {
> > > +             hash_table->find_locked = false;
> > > +             hash_table->find_exclusively_locked = false;
> > > +             LWLockRelease(PARTITION_LOCK(hash_table, partition));
> > > +     }
> > >  }
> >
> > This seems like a failure prone API.
> 
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.


> > [Andres complaining about comments and language stuff]
> 
> I would be happy to proof read and maybe extend the comments (writing
> new comments will also help me understand and review the code!), and
> maybe some code changes to move this forward.  Horiguchi-san, are you
> working on another version now?  If so I'll wait for it before I do
> that.

Cool! Being ESL myself and mildly dyslexic to boot, that'd be
helpful. But I'd hold off for a moment, because I think there'll need to
be some open heart surgery on this patch (see bottom of my last email in
this thread, for minutes ago (don't yet have a message id, sorry)).


> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
> 
> Right, I also noticed that:
> 
>     /*
>      * Local table stats should be applied to both dbentry and tabentry at
>      * once. Update dbentry only if we could update tabentry.
>      */
>     if (pgstat_update_tabentry(tabhash, entry, nowait))
>     {
>         pgstat_update_dbentry(dbent, entry);
>         updated = true;
>     }
> 
> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
> 
>     LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
>     dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
>     dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
>     dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
>     dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
>     dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
>     dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
>     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
>     LWLockRelease(&dbentry->lock);
> 
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?

We *need* to be lazy here, I think.


> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of
> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I don't think we really need that for the per-object stats. The easier
way to address that is to instead reduce the rate of flushing to the
shared table. There's not really a problem with the shared state of the
stats lagging by a few hundred ms or so.

The amount of code complexity a scheme like you describe doesn't seem
worth it to me without very clear evidence its needed. If we didn't need
to handle the case were the "static" slots are insufficient to handle
all the stats, it'd be different. But given the number of tables etc
that can exist in systems, I don't think that's achievable.


I think we should go for per-backend counters for other parts of the
system though. I think it should basically be the default for cluster
wide stats like IO (even if we additionally flush it to per table
stats). Currently we have more complicated schemes for those. But that's
imo a separate patch.


Thanks!

Andres



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you for looking this.

At Thu, 19 Mar 2020 16:51:59 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in 
> > This seems like a failure prone API.
> 
> If I understand correctly, the only purpose of the seqscan_running
> variable is to control that behaviour ^^^.  That is, to make
> dshash_delete_entry() keep the partition lock if you delete an entry
> while doing a seq scan.  Why not get rid of that, and provide a
> separate interface for deleting while scanning?
> dshash_seq_delete(dshash_seq_status *scan, void *entry).  I suppose it
> would be most common to want to delete the "current" item in the seq
> scan, but it could allow you to delete anything in the same partition,
> or any entry if using the "consistent" mode.  Oh, I see that Andres
> said the same thing later.

The attached v25 in [1] is the new version.

> > Why does this patch add the consistent mode? There's no users currently?
> > Without it's not clear that we need a seperate _term function, I think?
> 
> +1, let's not do that if we don't need it!

Yes, it is removed.

> > The fact that you're locking the per-database entry unconditionally once
> > for each table almost guarantees contention - and you're not using the
> > 'conditional lock' approach for that. I don't understand.
> 
> Right, I also noticed that:

I think I fixed all cases except drop or something like that needs
exclusive lock.

> So pgstat_update_tabentry() goes to great trouble to take locks
> conditionally, but then pgstat_update_dbentry() immediately does:
> 
>     LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
>     dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
>     dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
>     dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
>     dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
>     dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
>     dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
>     dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
>     LWLockRelease(&dbentry->lock);
> 
> Why can't we be "lazy" with the dbentry stats too?  Is it really
> important for the table stats and DB stats to agree with each other?
> Even if it were, your current coding doesn't achieve that: the table
> stats are updated before the DB stat under different locks, so I'm not
> sure why it can't wait longer.

It is done lazy way.

> Hmm.  Even if you change the above code use a conditional lock, I am
> wondering (admittedly entirely without data) if this approach is still
> too clunky: even trying and failing to acquire the lock creates
> contention, just a bit less.  I wonder if it would make sense to make
> readers do more work, so that writers can avoid contention.  For
> example, maybe PgStat_StatDBEntry could hold an array of N sets of
> counters, and readers have to add them all up.  An advanced version of

I thought that kind of solution but that needs more memory multipled
by the number of backends. If the contention is not negligible, we can
go back to stats collector process connected via sockets then share
the result on shared memory. The motive was the file I/O on reading
stats on backens.

> this idea would use a reasonably fresh copy of something like
> sched_getcpu() and numa_node_of_cpu() to select a partition to
> minimise contention and cross-node traffic, with a portable fallback
> based on PID or something.  CPU core/node awareness is something I
> haven't looked into too seriously, but it's been on my mind to solve
> some other problems.

I have got asked about the CPU core/node awareness several times.  It
might have a certain degree of needs.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Hello.

At Thu, 19 Mar 2020 12:54:10 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:
> > > I think we also can get rid of the dshash_delete changes, by instead
> > > adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
> > > or such.
> >
> > [009] (Fixed)
> > I'm not sure about the point of having two interfaces that are hard to
> > distinguish.  Maybe dshash_delete_current(dshash_seq_stat *status) is
> > enough(). I also reverted the dshash_delete().
> 
> Well, dshash_delete() cannot generally safely be used together with
> iteration. It has to be the current element etc. And I think the locking
> changes make dshash less robust. By explicitly tying "delete the current
> element" to the iterator, most of that can be avoided.

Sure.  By the way I forgot to remove seqscan_running stuff. Removed.

> > > >  /* SIGUSR1 signal handler for archiver process */
> > >
> > > Hm - this currently doesn't set up a correct sigusr1 handler for a
> > > shared memory backend - needs to invoke procsignal_sigusr1_handler
> > > somewhere.
> > >
> > > We can probably just convert to using normal latches here, and remove
> > > the current 'wakened' logic? That'll remove the indirection via
> > > postmaster too, which is nice.
> >
> > [018] (Fixed, separate patch 0005)
> > It seems better. I added it as a separate patch just after the patch
> > that turns archiver an auxiliary process.
> 
> I don't think it's correct to do it separately, but I can just merge
> that on commit.

Yes, it's just for the convenience of reviewing. Merged.

> > > > @@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
> > > >
> > > >      switch (backendType)
> > > >      {
> > > > +        case B_ARCHIVER:
> > > > +            backendDesc = "archiver";
> > > > +            break;
> > >
> > > should imo include 'WAL' or such.
> >
> > [019] (Not Fixed)
> > It is already named "archiver" by 8e8a0becb3. Do I rename it in this
> > patch set?
> 
> Oh. No, don't rename it as part of this. Could you reply to the thread
> in which Peter made that change, and reference this complaint?

I sent a mail like that.

https://www.postgresql.org/message-id/20200327.163007.128069746774242774.horikyota.ntt%40gmail.com

> > [021] (Fixed, separate patch 0007)
> > However the "statistics collector process" is gone, I'm not sure
> > "statistics collector" feature also is gone. But actually the word
> > "collector" looks a bit odd in some context. I replaced "the results
> > of statistics collector" with "the activity statistics". (I'm not sure
> > "the activity statistics" is proper as a subsystem name.) The word
> > "collect" is replaced with "track".  I didn't change section IDs
> > corresponding to the renaming so that old links can work. I also fixed
> > the tranche name for LWTRANCHE_STATS from "activity stats" to
> > "activity_statistics"
> 
> Without having gone through the changes, that sounds like the correct
> direction to me. There's no "collector" anymore, so removing that seems
> like the right thing.

Thanks.

> > > > diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> > > > index ca5c6376e5..1ffe073a1f 100644
> > > > --- a/src/backend/postmaster/pgstat.c
> > > > +++ b/src/backend/postmaster/pgstat.c
...
> > [024] (Fixed, Maybe)
> > Although not sure I get you correctly, I rewrote it as the follows.
..
> I was thinking of something like:
>  *  Collects activity statistics, e.g. per-table access statistics, of
>  *  all backends in shared memory. The activity numbers are first stored
>  *  locally in each process, then flushed to shared memory at commit
>  *  time or by idle-timeout.

Looks fine. Replaced it with the above.

> > [025] (Not Fixed)
> > Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
> > 10000) reasonable?
> 
> Partially. I think for access to shared resources we want *increasing*
> wait times, rather than shorter retry timeout. The goal should be to be
> to make it more likely for all processes to be able to flush their
> stats, which can be achieved by flushing less often after hitting
> contention.

Ah! Indeed. The attached works the following way.

 * To avoid congestion on the shared memory, shared stats is updated no more
 * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
 * remain unflushed for lock failure, retry with intervals that is initially
 * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
 * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.

Concretely the interval changes as:

    elapsed        interval
-------------+--------------    
    0ms            (1000ms)
    1000ms        250ms
    1250ms        500ms
    1750ms        1000ms
    2750ms        2000ms
    4759ms        5250ms (not 4000ms)
    10000ms
                
On the way fixing it I fixed several silly bugs:
  - pgstat_report_stat accessed dbent even if it is NULL.
  - pgstat_flush_tabstats set have_(sh|my)database_stats wrongly.

> > [029] (Fixed) (Related  to [046])
> > Mmm. It's following your old suggestion to avoid unsubstantial
> > diffs. I'm happy to change it. The functions that have "send" in their
> > names are for the same reason. I removed the prefix "m_" of the
> > members of the struct. (The comment above (with a typo) explains that).
> 
> I don't object to having the rename be a separate patch...

Nope. I don't want make it a separate patch.

> > > > +    if (StatsShmem->refcount > 0)
> > > > +        StatsShmem->refcount++;
> > >
> > > What prevents us from leaking the refcount here? We could e.g. error out
> > > while attaching, no? Which'd mean we'd leak the refcount.
> >
> > [033] (Fixed)
> > We don't attach shared stats on postmaster process, so I want to know
> > the first attacher process and the last detacher process of shared
> > stats.  It's not leaks that I'm considering here.
> > (continued below)
> >
> > > To me it looks like there's a lot of added complexity just because you
> > > want to be able to reset stats via
...
> > > Without that you'd not need the complexity of attaching, detaching to
> > > the same degree - every backend could just cache lookup data during
> > > initialization, instead of having to constantly re-compute that.
> >
> > Mmm. I don't get that (or I failed to read clear meaning). The
> > function is assumed be called only from StartupXLOG().
> > (continued)
> 
> Oh? I didn't get that you're only using it for that purpose - there's
> very little documentation about what it's trying to do.

Ugg..

> I don't see why that means we don't need to accurately track the
> refcount? Otherwise we'll forget to write out the stats.

Exactly, and I added comments for that.

|  * refcount is used to know whether a process going to detach shared stats is
|  * the last process or not. The last process writes out the stats files.
|  */
| typedef struct StatsShmemStruct

|     if (--StatsShmem->refcount < 1)
|     {
|         /*
|          * The process is the last one that is attaching the shared stats
|          * memory. Write out the stats files if requested.

> > > Nor would the dynamic re-creation of the db dshash table be needed.
..
> I was referring to the fact that the last version of the patch
> attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
> attach_table_hash, attach_function_hash etc.

pin/unpin is gone. Now there is only one dshash and it is attached for
the life time of process.

> > After some thoughts, I decided to rip the all "generation" stuff off
> > and it gets far simpler. But counter reset may conflict with other
> > backends with a litter higher degree because counter reset needs
> > exclusive lock.
> 
> That seems harmless to me - stats reset should never happen at a high
> enough frequency to make contention it causes problematic. There's also
> an argument to be made that it makes sense for the reset to be atomic.

Agreed.

> > > > +    /* Flush out table stats */
> > > > +    if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
> > > > +        pending_stats = true;
> > > > +
> > > > +    /* Flush out function stats */
> > > > +    if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
> > > > +        pending_stats = true;
> > >
> > > This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
> > > on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
> > > being confusing while reading the code, it also made the diff much
> > > harder to read.
> >
> > [035] (Maybe Fixed)
> > Is the question that, is there any case where
> > pgstat_flush_stat/functions leaves some counters unflushed?
> 
> No, the point is that there's knowledge about
> pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
> namely the pgStatTabList, pgStatFunctions lists.

Mmm. Anyway the stuff has been largely changed in this version.

> > > Why do we still have this? A hashtable lookup is cheap, compared to
> > > fetching a file - so it's not to save time. Given how infrequent the
> > > pgstat_fetch_* calls are, it's not to avoid contention either.
> > >
> > > At first one could think it's for consistency - but no, that's not it
> > > either, because snapshot_statentry() refetches the snapshot without
> > > control from the outside:
> >
> > [038]
> > I don't get the second paragraph. When the function re*create*s a
> > snapshot without control from the outside? It keeps snapshots during a
> > transaction.  If not, it is broken.
> > (continued)
> 
> Maybe I just misunderstood the code flow - partially due to the global
> clear_snapshot variable. I just had read the
> +     * We don't want so frequent update of stats snapshot. Keep it at least
> +     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
> 
> comment, and took it to mean that you're unconditionally updating the
> snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
> have consistent snapshot across all fetches.
> 
> (partially this might have been due to the diff:
>      /*
> -     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
> -     * msec since we last sent one, or the caller wants to force stats out.
> +     * We don't want so frequent update of stats snapshot. Keep it at least
> +     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
>       */

Wow.. I tried "git config --global diff.algorithm patience" and it
seems works well.

> But I think my question remains: Why do we need the whole snapshot thing
> now? Previously we needed to avoid reading a potentially large file -
> but that's not a concern anymore?
...
> > Currently the behavior is documented as the follows and it seems reasonable.
> >
...
> I am very unconvinded this is worth the cost. Especially because plenty
> of other stats related parts of the system do *NOT* behave this way. How
> is a user supposed to understand that pg_stat_database behaves one way,
> pg_stat_activity, another, pg_stat_statements a third,
> pg_stat_progress_* ...
> 
> Perhaps it's best to not touch the semantics here, but I'm also very
> wary of introducing significant complications and overhead just to have
> this "feature".

As a compromise, I removed the "clear_snapshot" stuff.  Snapshot still
works, but now clear_snapshot() immediately clear them. It works the
same way with pg_stat_activity.

> > > >       for (i = 0; i < tsa->tsa_used; i++)
> > > >       {
> > > >           PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> > > >
> > <many TableStatsArray code>
> > > >               hash_entry->tsa_entry = entry;
> > > >               dest_elem++;
> > > >           }
> > >
> > > This seems like too much code. Why is this entirely different from the
> > > way funcstats works? The difference was already too big before, but this
> > > made it *way* worse.
> >
> > [040]
> > We don't flush stats until transaction ends. So the description about
> > TabStatuArray is stale?
> 
> How is your comment related to my comment above?

Hmm. It looks like truncated. The TableStatsArray is removed, all
kinds of local stats (except gobal stats) is now stored directly in
pgStatLocalHashEntry. The code gets far simpler.

> > > >       generation = pin_hashes(dbent);
> > > >       tabhash = attach_table_hash(dbent, generation);
> > >
> > > This again is just cost incurred by insisting on destroying hashtables
> > > instead of keeping them around as long as necessary.
> >
> > [040]
> > Maybe you are insisting the reverse? The pin_hash complexity is left
> > in this version. -> [033]
> 
> What do you mean? What I'm saying is that we should never end up in a
> situation where there's no pgstat entry for the current database. And
> that that's trivial, as long as we don't drop the hashtable, but instead
> reset counters to 0.

In the previous version that was not sent to ML attach/detaches only
at the start/end time of process. But in this version table/function
dshashes are gone.

> > > The fact that you're locking the per-database entry unconditionally once
> > > for each table almost guarantees contention - and you're not using the
> > > 'conditional lock' approach for that. I don't understand.
> >
> > [043] (Maybe fixed) (Related to [045].)
> > Vacuum, analyze, DROP DB and reset cannot be delayed. So the
> > conditional lock is mainly used by
> > pgstat_report_stat().
> 
> You're saying "cannot be delayed" - but you're not explaining *why* that
> is.
> 
> Even if true, I don't see why that necessitates doing the flushing and
> locking once for each of these functions?

Sorry, that was wrong.  We can just skip removal on lock failure
during pgstat_vacuum_stat(). It will be retried the next time.  Other
database stats, deadlock, checksum failure, tmpfile and conflicts are
now collected locally then flushed.

> > dshash_find_or_insert didn't allow shared lock. I changed
> > dshash_find_extended to allow shared-lock even if it is told to create
> > a missing entry. Alrhough it takes exclusive lock at the mement of
> > entry creation, most of all cases it doesn't need exclusive lock. This
> > allows use shared lock while processing vacuum or analyze stats.
> 
> Huh?

Well, anyway, the shared-insert mode of dshash_find_extended is no
longer needed so I removed the mode in this version.

> Just to be crystal clear: I am exceedingly unlikely to commit this with
> any sort of short term attach/detach operations. Both because of the
> runtime overhead/contention it causes is significant, and because of the
> code complexity implied by it.

I think it is addressed in this version.

> Leaving attach/detach aside: I think it's a complete no-go to acquire
> database wide locks at this frequency, and then to hold them over other
> operations that are a) not cheap b) can block. The contention due to
> that would be *terrible* for scalability, even if it's just a shared
> lock.

> The way this *should* work is that:
> 1.1) At backend startup, attach to the database wide hashtable
> 1.2) At backend startup, attach to the various per-database hashtables
>   (including ones for shared tables)
> 2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
> 2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
> 2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
> 2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
> 3.1) When shutting down backend, detach from all hashtables
> 
> 
> That way we never need to hold onto the database-wide hashtables for
> long, and we can do it with conditional locks (trylock above), unless we
> need to force flushing.

I think the attached works the similar way. Table/function stats are
processed togehter, then database stats is processed.

> It might be worthwhile to merge per-table, per-function, per-database
> hashes into a single hash. Where the key is either something like
> {hashkind, objoid} (referenced from a per-database hashtable), or even
> {hashkind, dboid, objoid} (one global hashtable).
>
> I think the contents of the hashtable should likely just be a single
> dsa_pointer (plus some bookkeeping). Several reasons for that:
> 
> 1) Since one goal of this is to make the stats system more extensible,
>   it seems important that we can make the set of stats kept
>   runtime configurable. Otherwise everyone will continue to have to pay
>   the price for every potential stat that we have an option to track.
>
> 2) Having hashtable resizes move fairly large stat entries around is
>    expensive. Whereas just moving key + dsa_pointer around is pretty
>    cheap. I don't think the cost of a pointer dereference matters in
>    *this* case.
>
> 3) If the stats contents aren't moved around, there's no need to worry
>    about hashtable resizes. Therefore the stats can be referenced
>    without holding dshash partition locks.
> 
> 4) If the stats entries aren't moved around by hashtable resizes, we can
>    use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
>    not generally correct/safe to have dshash resize to move those
>    around.
> 
> 
> All of that would be addressed if we instead allocate the stats data
> separately from the dshash entry.

OK, I'm convinced by that (and I like it). The attached v27 is largely
changed from the previous version following the suggeston.

1) DB, table, function stats are stored into one hash keyed by (type,
   dbid, objectid) and handled in unified way. Now pgstat_report_stat
   flushes stats the following way.

    while (hash_seq_search on local stats hash)
     {
        switch (ent->stats->type)
        {
           case PGSTAT_TYPE_DB:  ...
           case PGSTAT_TYPE_TABLE:  ...
           case PGSTAT_TYPE_FUNCTION:  ...
        }
    }
           
2, 3) There's only one dshash table pgStatSharedHash.  Its entry is
   defined as the follows.

   +typedef struct PgStatHashEntry
   +{
   +    PgStatHashEntryKey    key;    /* hash key */
   +    dsa_pointer            stats;    /* pointer to shared stats entry in DSA */
   +} PgStatHashEntry;

   key is (type, databaseid, objectid)

   To handle entries of different types common way, the hash entry
   points to the following struct stored in DSA memory.

   +typedef struct PgStatEntry
   +{
   +    PgStatTypes    type;        /* statistics entry type */
   +    size_t        len;        /* length of body, fixed per type. */
   +    LWLock        lock;        /* lightweight lock to protect body */
   +    char        body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
   +} PgStatEntry;

   The body stores the existing PgStat_Stat*Entry structs.

   To match the shared stats, locally-stored stats entries are changed
   similar way.

4) As shown above, I'm using LWLock in this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From c6baa406e0efb15504049cfbd33602f1c1d65b42 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v27 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From 1592ccaf8fd790a18768dc301338efef99a99af5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v27 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 150 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..fb7e23c4cb 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,146 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        /* Move lock along with partition for the bucket */
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Lock the next partition then release the current, not in the
+             * reverse order to avoid concurrent resizing. Partitions are
+             * locked in the same order with resize() so dead locks won't
+             * happen.
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 9d7d47040b64e32ef58ba4acac381bd5362f9615 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v27 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 99 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 56 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index fb7e23c4cb..b4dc8e1ece 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,61 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +484,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +499,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 6d52d83a00e9b0e8849f5adfe3871099d602485e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v27 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlog.c        |  49 +++++++++++
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++---
 src/backend/postmaster/pgarch.c          | 102 ++++-------------------
 src/backend/postmaster/postmaster.c      |  53 ++++++------
 src/include/access/xlog.h                |   2 +
 src/include/access/xlog_internal.h       |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 10 files changed, 111 insertions(+), 127 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..4da7ed3657 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -680,6 +680,13 @@ typedef struct XLogCtlData
      */
     Latch        recoveryWakeupLatch;
 
+    /*
+     * archiverWakeupLatch is used to wake up the archiver process to process
+     * completed WAL segments, if it is waiting for WAL to arrive.
+     * Protected by info_lck.
+     */
+    Latch       *archiverWakeupLatch;
+
     /*
      * During recovery, we keep a copy of the latest checkpoint record here.
      * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8381,6 +8388,48 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
     return result;
 }
 
+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+    Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    old_latch = XLogCtl->archiverWakeupLatch;
+    XLogCtl->archiverWakeupLatch = MyLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+    Assert (old_latch == NULL);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    XLogCtl->archiverWakeupLatch = NULL;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+    Latch *latch;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    latch = XLogCtl->archiverWakeupLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (latch)
+        SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 914ad340ea..47c2b4a373 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -489,7 +489,7 @@ XLogArchiveNotify(const char *xlog)
 
     /* Notify archiver that it's got something to do */
     if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+        XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,7 +79,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,8 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +141,21 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+    XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,7 +167,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
@@ -240,24 +176,14 @@ PgArchiverMain(int argc, char *argv[])
     MyBackendType = B_ARCHIVER;
     init_ps_display(NULL);
 
+    XLogArchiveWakeupStart();
+    on_shmem_exit(PgArchiverKill, 0);
+    
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2b9ab32293..fab4a9dd51 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5494,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 331497bcfb..f38eaee092 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -311,6 +311,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..a272d62b1f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -331,6 +331,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
-- 
2.18.2

From 0d409e7384f01a6a69374e90731efcb357905b4f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v27 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   54 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4843 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  514 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 21 files changed, 2055 insertions(+), 3598 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4da7ed3657..cee0572367 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8528,9 +8528,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d81bc3cbc..4de574ae00 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1687,28 +1687,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be sent by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index da75e755f0..c00b04a624 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -336,9 +336,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1936,8 +1933,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1956,12 +1951,6 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
@@ -2009,9 +1998,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2090,8 +2076,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2174,8 +2160,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2734,29 +2720,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2777,17 +2740,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2811,8 +2769,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
         /*
          * Send off activity statistics to the stats collector
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
          * worth the trouble to split the stats support into two independent
          * stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
         /*
          * Report interim activity statistics to the stats collector.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4763c24be9..c0760854f4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,75 +100,162 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
+ * Shared stats bootstrap information, protected by StatsLock.
+ *
+ * refcount is used to know whether a process going to detach shared stats is
+ * the last process or not. The last process writes out the stats files.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
 
-static time_t last_pgstat_start_time;
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. pgStatLocalHashEntry is the equivalent of pgStatSharedHash for local stat
entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -236,11 +292,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +304,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -270,523 +323,269 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+
+static void init_dbentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_tabentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
+}
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    /*
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
+     */
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,144 +593,479 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we have pending local stats, let the caller know the retry interval.
+     */
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
     }
 
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_tabentry(PgStatEnvelope * env)
+{
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * pgstat_flush_funcstat - flush out a local function stats entry
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (sharedent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_funcentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -939,257 +1073,102 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_dbentry(PgStatEnvelope * env)
+{
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
+    envlist[n] = NULL;
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1198,7 +1177,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1212,7 +1191,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1218,184 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry -
+ *
+ *  Deletes the specified entry from shared stats hash
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1404,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1453,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,17 +1492,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1383,48 +1541,63 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1608,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1445,10 +1619,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1466,158 +1640,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1628,26 +1816,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1660,31 +1837,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1725,9 +1909,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1739,8 +1920,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1760,7 +1940,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1776,116 +1957,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(CacheMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
-/*
- * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
+
+/* ----------
+ * find_tabstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one.
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2362,7 +2584,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2398,7 +2620,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2415,88 +2637,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2510,24 +2820,48 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
 }
 
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +2933,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2951,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3168,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3345,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3366,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3044,6 +3383,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3645,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3599,9 +3941,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4230,94 +4569,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4326,424 +4642,30 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4760,7 +4682,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4772,32 +4694,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4808,6 +4729,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4840,55 +4763,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -4896,8 +4783,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4924,24 +4811,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4975,94 +4869,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5071,7 +5036,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5079,38 +5044,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5124,7 +5081,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5133,76 +5090,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5212,59 +5126,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5277,14 +5182,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5297,25 +5202,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5331,25 +5232,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5365,7 +5261,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5373,292 +5269,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5677,741 +5296,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    if (pgStatSnapshotContext)
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
+        MemoryContextReset(pgStatSnapshotContext);
 
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index fab4a9dd51..d418fe3bd0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..26414dadb2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1947,7 +1947,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2057,7 +2057,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2247,7 +2247,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2255,7 +2255,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 00c77b66c7..e2998f965e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index cea01534a5..a1304dc3ce 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1244,7 +1241,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1405,7 +1402,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1436,7 +1433,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1451,7 +1448,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1466,7 +1463,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1481,7 +1478,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1496,7 +1493,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1511,11 +1508,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1530,7 +1527,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1548,7 +1545,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1585,7 +1582,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;

     PG_RETURN_FLOAT8(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a07012bf4b..6fad13c4be 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -111,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -156,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -181,280 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -600,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -610,29 +246,55 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -651,19 +313,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -678,7 +336,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -694,7 +352,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -760,7 +418,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1004,7 +661,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1201,13 +858,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1222,29 +881,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1405,8 +1061,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1415,11 +1071,15 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 23f99af7ca3754bcd8bb567e2d8424a0e0abd5b3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v27 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 355b408b0a..680e1c3564 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6999,11 +6999,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7019,14 +7019,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7057,9 +7056,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8077,7 +8076,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8090,7 +8089,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bc4d98fe03..d56afa17db 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2357,12 +2357,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e87fb9634e..80ad6e72dc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4170,9 +4162,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 13bd320b31..52c61d222a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1259,11 +1259,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From d7b5e4d7a44e75a973ff1cefe68e12325417e320 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v27 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 680e1c3564..43d0d303ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7111,25 +7111,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c0760854f4..053cd467fd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * Shared stats bootstrap information, protected by StatsLock.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..7b7d87b938 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index af876d1f01..cabeb806c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4231,17 +4230,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11518,35 +11506,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index aa44f0c9bf..207e042e99 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -573,7 +573,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6fad13c4be..4971a88c70 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..f3340f726c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Alvaro Herrera
Дата:
On 2020-Mar-27, Kyotaro Horiguchi wrote:

> +/*
> + * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
> + */
> +void
> +XLogArchiveWakeupStart(void)
> +{
> +    Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
> +
> +    SpinLockAcquire(&XLogCtl->info_lck);
> +    old_latch = XLogCtl->archiverWakeupLatch;
> +    XLogCtl->archiverWakeupLatch = MyLatch;
> +    SpinLockRelease(&XLogCtl->info_lck);
> +    Assert (old_latch == NULL);
> +}

Comment is wrong about the function name; OTOH I don't think the
old_latch assigment in the fourth line won't work well in non-assert
builds.  But why do you need those shenanigans?  Surely
"Assert(XLogCtl->archiverWakeupLatch == NULL)" in the locked region
before assigning MyLatch should be sufficient and acceptable?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you for looking this.

At Fri, 27 Mar 2020 12:34:02 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2020-Mar-27, Kyotaro Horiguchi wrote:
> 
> > +/*
> > + * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
> > + */
> > +void
> > +XLogArchiveWakeupStart(void)
> > +{
> > +    Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
> > +
> > +    SpinLockAcquire(&XLogCtl->info_lck);
> > +    old_latch = XLogCtl->archiverWakeupLatch;
> > +    XLogCtl->archiverWakeupLatch = MyLatch;
> > +    SpinLockRelease(&XLogCtl->info_lck);
> > +    Assert (old_latch == NULL);
> > +}
> 
> Comment is wrong about the function name; OTOH I don't think the

Oops!  I found a similar mistake in another place. (pgstat_flush_funcentry)

> old_latch assigment in the fourth line won't work well in non-assert
> builds.  But why do you need those shenanigans?  Surely
> "Assert(XLogCtl->archiverWakeupLatch == NULL)" in the locked region
> before assigning MyLatch should be sufficient and acceptable?

Right. Maybe I wanted to move the assertion out of the lock section,
but that's actually useless.

Fixed them and rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 02b6ed91dd59ad0e2bf07dc46114f6954b34a292 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v28 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From 632c5e2632884b1322c898d3737451fa44cd9894 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v28 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 150 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..fb7e23c4cb 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,146 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        /* Move lock along with partition for the bucket */
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Lock the next partition then release the current, not in the
+             * reverse order to avoid concurrent resizing. Partitions are
+             * locked in the same order with resize() so dead locks won't
+             * happen.
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 06f4074ce7c80379ff6a088e85271b82e462f111 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v28 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 99 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 56 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index fb7e23c4cb..b4dc8e1ece 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,61 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +484,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +499,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From e5e47039ac582bc004d96acac2b0c6239f46fa5b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v28 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlog.c        |  46 ++++++++++
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++---
 src/backend/postmaster/pgarch.c          | 102 ++++-------------------
 src/backend/postmaster/postmaster.c      |  53 ++++++------
 src/include/access/xlog.h                |   2 +
 src/include/access/xlog_internal.h       |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 10 files changed, 108 insertions(+), 127 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8fe92962b0..5e663699d5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -681,6 +681,13 @@ typedef struct XLogCtlData
      */
     Latch        recoveryWakeupLatch;
 
+    /*
+     * archiverWakeupLatch is used to wake up the archiver process to process
+     * completed WAL segments, if it is waiting for WAL to arrive.  Protected
+     * by info_lck.
+     */
+    Latch       *archiverWakeupLatch;
+
     /*
      * During recovery, we keep a copy of the latest checkpoint record here.
      * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8386,6 +8393,45 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
     return result;
 }
 
+/*
+ * XLogArchiveWakeupStart - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    Assert(XLogCtl->archiverWakeupLatch == NULL);
+    XLogCtl->archiverWakeupLatch = MyLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    XLogCtl->archiverWakeupLatch = NULL;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+    Latch *latch;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    latch = XLogCtl->archiverWakeupLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (latch)
+        SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 914ad340ea..47c2b4a373 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -489,7 +489,7 @@ XLogArchiveNotify(const char *xlog)
 
     /* Notify archiver that it's got something to do */
     if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+        XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,7 +79,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,8 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +141,21 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+    XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,7 +167,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
@@ -240,24 +176,14 @@ PgArchiverMain(int argc, char *argv[])
     MyBackendType = B_ARCHIVER;
     init_ps_display(NULL);
 
+    XLogArchiveWakeupStart();
+    on_shmem_exit(PgArchiverKill, 0);
+    
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 73d278f3b2..a4b9d212a2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5494,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2b1b67d35c..cc134fd61e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,6 +312,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..a272d62b1f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -331,6 +331,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
-- 
2.18.2

From b50d4f62f0539048c3f48b6e48944e456ce4b809 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v28 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   54 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4864 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  514 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 21 files changed, 2062 insertions(+), 3612 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5e663699d5..749c2e8adb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8530,9 +8530,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d81bc3cbc..4de574ae00 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1687,28 +1687,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be sent by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..800693af4a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,12 +1953,6 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
         /*
          * Send off activity statistics to the stats collector
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
          * worth the trouble to split the stats support into two independent
          * stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
         /*
          * Report interim activity statistics to the stats collector.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab42df7e1b..8c65df9bfd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,75 +100,162 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
+ * Shared stats bootstrap information, protected by StatsLock.
+ *
+ * refcount is used to know whether a process going to detach shared stats is
+ * the last process or not. The last process writes out the stats files.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
 
-static time_t last_pgstat_start_time;
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. pgStatLocalHashEntry is the equivalent of pgStatSharedHash for local stat
entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -236,11 +292,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +304,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -270,523 +323,269 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+
+static void init_dbentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_tabentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
+}
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    /*
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
+     */
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,144 +593,488 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we have pending local stats, let the caller know the retry interval.
+     */
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
     }
 
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_tabentry(PgStatEnvelope * env)
+{
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (sharedent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_funcentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -939,257 +1082,102 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_dbentry(PgStatEnvelope * env)
+{
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
+    envlist[n] = NULL;
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1198,7 +1186,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1212,7 +1200,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1227,184 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1413,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1462,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,17 +1501,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1383,48 +1550,63 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1617,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1445,10 +1628,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1466,158 +1649,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1628,26 +1825,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1660,31 +1846,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1725,9 +1918,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1739,8 +1929,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1760,7 +1949,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1776,116 +1966,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(CacheMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2362,7 +2593,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2398,7 +2629,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2415,88 +2646,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2510,24 +2829,48 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
 }
 
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +2942,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2960,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3177,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3354,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3375,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3044,6 +3392,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3654,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3599,9 +3950,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4230,94 +4578,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4326,425 +4651,30 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4761,7 +4691,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4773,32 +4703,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4809,6 +4738,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4841,55 +4772,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -4897,8 +4792,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4925,24 +4820,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4976,94 +4878,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5072,7 +5045,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5080,38 +5053,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5125,7 +5090,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5134,76 +5099,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5213,59 +5135,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5278,14 +5191,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5298,25 +5211,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5332,25 +5241,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5366,7 +5270,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5374,292 +5278,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5678,756 +5305,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
 
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
+    if (pgStatSnapshotContext)
     {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
+        MemoryContextReset(pgStatSnapshotContext);
 
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b9d212a2..83ca2c1113 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e05e2b3456..26414dadb2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1947,7 +1947,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2057,7 +2057,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2247,7 +2247,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2255,7 +2255,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index cb8c23e4b7..29b9b8d9fc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3189,6 +3189,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3763,6 +3769,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4201,6 +4208,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4213,8 +4222,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4249,7 +4263,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4257,6 +4271,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6d66ff8b44..b59b22c367 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1388,7 +1385,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1437,7 +1434,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1452,7 +1449,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1467,7 +1464,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1482,7 +1479,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1497,7 +1494,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1512,7 +1509,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1527,11 +1524,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1546,7 +1543,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1564,7 +1561,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1617,7 +1614,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 763c1ee2bd..69bd794806 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -111,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -156,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -181,280 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -600,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -610,29 +246,55 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -652,19 +314,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -679,7 +337,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -695,7 +353,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -761,7 +419,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1005,7 +662,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1202,13 +859,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1223,29 +882,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1406,8 +1062,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1416,11 +1072,15 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From f1c3824c394f84e2d5a9abd87e07ec0f395598c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v28 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..7ed2b3884c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7008,11 +7008,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7028,14 +7028,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7066,9 +7065,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8129,7 +8128,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8142,7 +8141,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index bb5d9962ed..fd8f6098e1 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2357,12 +2357,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 220b8164c3..efdcd6fda8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4177,9 +4169,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From c21b8db796c97211c50c186b8456a53e4fe276d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v28 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed2b3884c..0c251c8ac6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7120,25 +7120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8c65df9bfd..fbb6287fa5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * Shared stats bootstrap information, protected by StatsLock.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..7b7d87b938 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..f94cec4677 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4251,17 +4250,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11538,35 +11526,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..4fd040b9c7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -572,7 +572,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 69bd794806..263f9ace1f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9575268bd7..f3340f726c 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Conflicted with 616ae3d2b0, so rebased.

Fixed on broken comment style.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ff3dba49721871c72d681a6227e1f99fa59bbb1e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v29 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From 7645095e932ae2c6ef9d62d93ce76cbda34b57f4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v29 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 150 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 170 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..fb7e23c4cb 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,146 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert (status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Also move parititon lock if needed */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        /* Move lock along with partition for the bucket */
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Lock the next partition then release the current, not in the
+             * reverse order to avoid concurrent resizing. Partitions are
+             * locked in the same order with resize() so dead locks won't
+             * happen.
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..81a929b8d9 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+    dsa_pointer            pnextitem;
+    int                    curpartition;
+    bool                exclusive;
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From e9b95aceddf8534b886bff62ee08303c35d7e51b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v29 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 99 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 56 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index fb7e23c4cb..b4dc8e1ece 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,61 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "insert" indicates insert mode. In this mode new entry is inserted and set
+ * *found to false. *found is set to true if found. "found" must be non-null in
+ * this mode.
+ *
+ * If nowait is true, the function immediately returns if required lock was not
+ * acquired.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +484,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +499,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 81a929b8d9..80a896a99b 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 598907c349d19e5f24da64b7f3120c422b5202bf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v29 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlog.c        |  46 ++++++++++
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++---
 src/backend/postmaster/pgarch.c          | 102 ++++-------------------
 src/backend/postmaster/postmaster.c      |  53 ++++++------
 src/include/access/xlog.h                |   2 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 10 files changed, 108 insertions(+), 127 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..d75a257da2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -684,6 +684,13 @@ typedef struct XLogCtlData
      */
     Latch        recoveryWakeupLatch;
 
+    /*
+     * archiverWakeupLatch is used to wake up the archiver process to process
+     * completed WAL segments, if it is waiting for WAL to arrive.  Protected
+     * by info_lck.
+     */
+    Latch       *archiverWakeupLatch;
+
     /*
      * During recovery, we keep a copy of the latest checkpoint record here.
      * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8450,6 +8457,45 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
     return result;
 }
 
+/*
+ * XLogArchiveWakeupStart - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    Assert(XLogCtl->archiverWakeupLatch == NULL);
+    XLogCtl->archiverWakeupLatch = MyLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    XLogCtl->archiverWakeupLatch = NULL;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+    Latch *latch;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    latch = XLogCtl->archiverWakeupLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (latch)
+        SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..d6c8bcce8b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -490,7 +490,7 @@ XLogArchiveNotify(const char *xlog)
 
     /* Notify archiver that it's got something to do */
     if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+        XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..6fe7a136ba 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,7 +79,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,8 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +141,21 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+    XLogArchiveWakeupEnd();
+}
+
 /*
  * PgArchiverMain
  *
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
  *    since we don't use 'em, it hardly matters...
  */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,7 +167,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
@@ -240,24 +176,14 @@ PgArchiverMain(int argc, char *argv[])
     MyBackendType = B_ARCHIVER;
     init_ps_display(NULL);
 
+    XLogArchiveWakeupStart();
+    on_shmem_exit(PgArchiverKill, 0);
+    
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 73d278f3b2..a4b9d212a2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5494,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..c0e149d526 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
-- 
2.18.2

From 010887ce9294dea8a4e8b4ea6183dfd4e6abc1e0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v29 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/transam/xlog.c            |    4 +-
 src/backend/catalog/index.c                  |   24 +-
 src/backend/postmaster/autovacuum.c          |   54 +-
 src/backend/postmaster/bgwriter.c            |    2 +-
 src/backend/postmaster/checkpointer.c        |   12 +-
 src/backend/postmaster/pgarch.c              |    4 +-
 src/backend/postmaster/pgstat.c              | 4865 +++++++-----------
 src/backend/postmaster/postmaster.c          |   85 +-
 src/backend/storage/buffer/bufmgr.c          |    8 +-
 src/backend/storage/ipc/ipci.c               |    2 +
 src/backend/storage/lmgr/lwlock.c            |    1 +
 src/backend/storage/lmgr/lwlocknames.txt     |    1 +
 src/backend/tcop/postgres.c                  |   26 +-
 src/backend/utils/adt/pgstatfuncs.c          |   53 +-
 src/backend/utils/init/globals.c             |    1 +
 src/backend/utils/init/postinit.c            |   11 +
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |    4 +-
 src/include/miscadmin.h                      |    2 +
 src/include/pgstat.h                         |  514 +-
 src/include/storage/lwlock.h                 |    1 +
 src/include/utils/timeout.h                  |    1 +
 21 files changed, 2063 insertions(+), 3612 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d75a257da2..99beaf0215 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8594,9 +8594,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..31202766e2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1699,28 +1699,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be sent by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..800693af4a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,12 +1953,6 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..94bdd664b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -236,7 +236,7 @@ BackgroundWriterMain(void)
         /*
          * Send off activity statistics to the stats collector
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..8a2fd0ddb2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -492,7 +492,7 @@ CheckpointerMain(void)
          * worth the trouble to split the stats support into two independent
          * stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -693,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
         /*
          * Report interim activity statistics to the stats collector.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 6fe7a136ba..f0b524ca50 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -401,7 +401,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we successfully
                  * archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
@@ -411,7 +411,7 @@ pgarch_ArchiverCopyLoop(void)
                  * Tell the collector about the WAL file that we failed to
                  * archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab42df7e1b..7679b20833 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,75 +100,163 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
+ * Shared stats bootstrap information, protected by StatsLock.
+ *
+ * refcount is used to know whether a process going to detach shared stats is
+ * the last process or not. The last process writes out the stats files.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
 
-static time_t last_pgstat_start_time;
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -236,11 +293,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +305,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -270,523 +324,269 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+
+static void init_dbentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_tabentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
+}
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    /*
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
+     */
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,144 +594,488 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we have pending local stats, let the caller know the retry interval.
+     */
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
     }
 
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_tabentry(PgStatEnvelope * env)
+{
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (sharedent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_funcentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -939,257 +1083,102 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_dbentry(PgStatEnvelope * env)
+{
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
+    envlist[n] = NULL;
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1198,7 +1187,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1212,7 +1201,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1228,184 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1414,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1463,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,17 +1502,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1383,48 +1551,63 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;

-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1618,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1445,10 +1629,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1466,158 +1650,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1628,26 +1826,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1660,31 +1847,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1725,9 +1919,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1739,8 +1930,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1760,7 +1950,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1776,116 +1967,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(CacheMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2362,7 +2594,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2398,7 +2630,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2415,88 +2647,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2510,24 +2830,48 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
 }
 
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +2943,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2961,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3178,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3355,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3376,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3044,6 +3393,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3655,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3599,9 +3951,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4230,94 +4579,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4326,425 +4652,30 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4761,7 +4692,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4773,32 +4704,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4809,6 +4739,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4841,55 +4773,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -4897,8 +4793,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4925,24 +4821,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4976,94 +4879,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5072,7 +5046,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5080,38 +5054,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5125,7 +5091,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5134,76 +5100,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5213,59 +5136,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5278,14 +5192,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5298,25 +5212,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5332,25 +5242,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5366,7 +5271,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5374,292 +5279,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5678,756 +5306,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
 
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
+    if (pgStatSnapshotContext)
     {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
+        MemoryContextReset(pgStatSnapshotContext);
 
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b9d212a2..83ca2c1113 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3928,8 +3870,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4130,8 +4071,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5048,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5239,12 +5166,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6060,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6115,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6431,7 +6349,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e72d607a23..ea5030452d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1948,7 +1948,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2058,7 +2058,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2248,7 +2248,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2256,7 +2256,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5b677863b9..4461214ff3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3192,6 +3192,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3766,6 +3772,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4204,6 +4211,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4216,8 +4225,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4252,7 +4266,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4260,6 +4274,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6d66ff8b44..b59b22c367 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1388,7 +1385,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1437,7 +1434,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1452,7 +1449,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1467,7 +1464,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1482,7 +1479,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1497,7 +1494,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1512,7 +1509,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1527,11 +1524,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1546,7 +1543,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1564,7 +1561,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1617,7 +1614,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..9f1de1e42f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 763c1ee2bd..69bd794806 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -111,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -156,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -181,280 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -600,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -610,29 +246,55 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -652,19 +314,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -679,7 +337,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -695,7 +353,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -761,7 +419,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1005,7 +662,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1202,13 +859,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1223,29 +882,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1406,8 +1062,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1416,11 +1072,15 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From e009ef0fc896e4bd5b3b1a7a5036d07a4b713ab6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v29 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..7ed2b3884c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7008,11 +7008,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7028,14 +7028,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7066,9 +7065,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8129,7 +8128,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8142,7 +8141,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..b9b73e59f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2381,12 +2381,13 @@ HINT:  Recovery cannot continue unless the configuration is changed and the serv
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 220b8164c3..efdcd6fda8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4177,9 +4169,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From 6128bcd8b1c41b9e06f768a93d779062a364e47f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v29 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed2b3884c..0c251c8ac6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7120,25 +7120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7679b20833..d601c6f114 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * Shared stats bootstrap information, protected by StatsLock.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..7b7d87b938 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..f94cec4677 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4251,17 +4250,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11538,35 +11526,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..4fd040b9c7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -572,7 +572,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 69bd794806..263f9ace1f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..28b39f6b2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
> Conflicted with 616ae3d2b0, so rebased.

I made some cleanup. (v30)

- Added comments for members of dshash_seq_scans.
- Some style fix and comment fix of dshash.

- Cleaned up more usage of the word "stat(istics) collector" in comments,
- Changed the GUC attribute STATS_COLLECTOR to STATS_ACTIVITY
- Removed duplicate setup of MyBackendType and ps display in PgArchiverMain
- Removed B_STATS_COLLECTOR from BackendType and removed related code.
- Corrected the comment of PgArchiverMain, which mentioned "argv/argc".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From ff3dba49721871c72d681a6227e1f99fa59bbb1e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v30 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From 8a4f7987c1b0fab77e079168633b958a7e2a5a75 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v30 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 0deacffac4fb001a9ba764358b9637fb986b6e58 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v30 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..ee612b778f 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" means the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 624b9eb4703fe1d827aa78a595fcfc4a4bb3c7a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v30 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlog.c        |  46 +++++++++
 src/backend/access/transam/xlogarchive.c |   2 +-
 src/backend/bootstrap/bootstrap.c        |  22 +++--
 src/backend/postmaster/pgarch.c          | 115 ++++-------------------
 src/backend/postmaster/postmaster.c      |  53 ++++++-----
 src/include/access/xlog.h                |   2 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 10 files changed, 112 insertions(+), 136 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..d75a257da2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -684,6 +684,13 @@ typedef struct XLogCtlData
      */
     Latch        recoveryWakeupLatch;
 
+    /*
+     * archiverWakeupLatch is used to wake up the archiver process to process
+     * completed WAL segments, if it is waiting for WAL to arrive.  Protected
+     * by info_lck.
+     */
+    Latch       *archiverWakeupLatch;
+
     /*
      * During recovery, we keep a copy of the latest checkpoint record here.
      * lastCheckPointRecPtr points to start of checkpoint record and
@@ -8450,6 +8457,45 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
     return result;
 }
 
+/*
+ * XLogArchiveWakeupStart - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    Assert(XLogCtl->archiverWakeupLatch == NULL);
+    XLogCtl->archiverWakeupLatch = MyLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogArchiveWakeupEnd - Clean up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupEnd(void)
+{
+    SpinLockAcquire(&XLogCtl->info_lck);
+    XLogCtl->archiverWakeupLatch = NULL;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
+/*
+ * XLogWakeupArchiver - Wake up archiver process
+ */
+void
+XLogArchiveWakeup(void)
+{
+    Latch *latch;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    latch = XLogCtl->archiverWakeupLatch;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    if (latch)
+        SetLatch(latch);
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..d6c8bcce8b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -490,7 +490,7 @@ XLogArchiveNotify(const char *xlog)
 
     /* Notify archiver that it's got something to do */
     if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+        XLogArchiveWakeup();
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..7b66cb5613 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,7 +79,6 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
@@ -95,8 +95,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +108,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +141,16 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Clean up notification stuff on exit */
+static void
+PgArchiverKill(int code, Datum arg)
+{
+    XLogArchiveWakeupEnd();
+}
+
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +162,23 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
-
+    XLogArchiveWakeupStart();
+    on_shmem_exit(PgArchiverKill, 0);
+    
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 73d278f3b2..a4b9d212a2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -146,7 +146,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_ARCHIVER    0x0010    /* archiver process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -539,6 +540,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1787,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3057,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3192,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3449,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3654,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3938,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5219,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5262,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5494,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..c0e149d526 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,8 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
-- 
2.18.2

From 6c75b3f8e226a1bc56f7fe74a017284f0f2de4d1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v30 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   46 +-
 src/backend/postmaster/pgarch.c               |   14 +-
 src/backend/postmaster/pgstat.c               | 4869 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    1 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |   12 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   55 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   12 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  514 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 33 files changed, 2136 insertions(+), 3689 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca52846b97..d1be516db4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1074,8 +1074,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 04b12342b8..36e2d542f6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -584,7 +584,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d75a257da2..99beaf0215 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8594,9 +8594,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..46fe9fd85f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1699,28 +1699,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb0..a18e7068ae 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e5a5eef102..4151fa5335 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 59731d687f..4e4d34d63b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -321,8 +321,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..19a2357f0d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..12f06a316d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -533,29 +533,29 @@ HandleCheckpointerInterrupts(void)
         ProcessConfigFile(PGC_SIGHUP);
 
         /*
-         * Checkpointer is the last process to shut down, so we ask it to
-         * hold the keys for a range of other tasks required most of which
-         * have nothing to do with checkpointing at all.
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
          *
-         * For various reasons, some config values can change dynamically
-         * so the primary copy of them is held in shared memory to make
-         * sure all backends see the same value.  We make Checkpointer
-         * responsible for updating the shared memory copy if the
-         * parameter setting changes because of SIGHUP.
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
          */
         UpdateSharedMemoryConfig();
     }
     if (ShutdownRequestPending)
     {
         /*
-         * From here on, elog(ERROR) should end with exit(1), not send
-         * control back to the sigsetjmp block above
+         * From here on, elog(ERROR) should end with exit(1), not send control
+         * back to the sigsetjmp block above
          */
         ExitOnAnyError = true;
         /* Close down the database */
         ShutdownXLOG(0, 0);
         /* Normal exit from the checkpointer is here */
-        proc_exit(0);        /* done */
+        proc_exit(0);            /* done */
     }
 }
 
@@ -691,9 +691,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 7b66cb5613..f271714f45 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -173,7 +173,7 @@ PgArchiverMain(void)
 
     XLogArchiveWakeupStart();
     on_shmem_exit(PgArchiverKill, 0);
-    
+
     pgarch_MainLoop();
 
     exit(0);
@@ -393,20 +393,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ab42df7e1b..6e76cc40bc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,75 +100,163 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
 /*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
-/* ----------
- * Local data
- * ----------
+ * Shared stats bootstrap information, protected by StatsLock.
+ *
+ * refcount is used to know whether a process going to detach shared stats is
+ * the last process or not. The last process writes out the stats files.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
 
-static time_t last_pgstat_start_time;
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -236,11 +293,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -249,19 +305,17 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -270,523 +324,269 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+
+static void init_dbentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_tabentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
+}
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
+    /*
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
+     */
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
+    {
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
-        test_byte++;            /* just make sure variable is changed */
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
 
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->refcount = 1;
     }
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockRelease(StatsLock);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!area)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -794,144 +594,488 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we have pending local stats, let the caller know the retry interval.
+     */
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
     }
 
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_tabentry(PgStatEnvelope * env)
+{
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
-
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (sharedent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_funcentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -939,257 +1083,102 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_dbentry(PgStatEnvelope * env)
+{
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
+
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
+    envlist[n] = NULL;
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1198,7 +1187,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1212,7 +1201,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1239,65 +1228,184 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1306,20 +1414,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1328,29 +1463,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1359,17 +1502,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1383,48 +1551,63 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1435,9 +1618,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1445,10 +1629,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1466,158 +1650,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1628,26 +1826,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1660,31 +1847,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1725,9 +1919,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1739,8 +1930,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1760,7 +1950,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1776,116 +1967,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(CacheMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2316,8 +2548,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2362,7 +2594,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2398,7 +2630,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2415,88 +2647,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2510,24 +2830,48 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
 }
 
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    LWLockRelease(StatsLock);
+
+    global_snapshot_is_valid = true;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
+}
 
 /* ----------
  * pgstat_fetch_stat_beentry() -
@@ -2599,9 +2943,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2616,9 +2961,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2832,8 +3178,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3009,12 +3355,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3027,7 +3376,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3044,6 +3393,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3304,7 +3655,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3599,9 +3951,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4230,94 +4579,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4326,425 +4652,30 @@ pgstat_send_bgwriter(void)
 }
 
 
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
-
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4761,7 +4692,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4773,32 +4704,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4809,6 +4739,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4841,55 +4773,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -4897,8 +4793,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4925,24 +4821,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4976,94 +4879,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5072,7 +5046,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5080,38 +5054,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5125,7 +5091,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5134,76 +5100,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5213,59 +5136,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5278,14 +5192,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5298,25 +5212,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5332,25 +5242,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5366,7 +5271,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5374,292 +5279,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5678,756 +5306,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
 
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
+    if (pgStatSnapshotContext)
     {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
+        MemoryContextReset(pgStatSnapshotContext);
 
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a4b9d212a2..51d02b6cd7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -255,7 +255,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -503,7 +502,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1780,11 +1772,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2694,8 +2681,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3058,8 +3043,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3126,13 +3109,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3205,22 +3181,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3681,22 +3641,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3892,8 +3836,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3917,8 +3859,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3928,8 +3869,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4130,8 +4070,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5109,18 +5047,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5239,12 +5165,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6139,7 +6059,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6195,8 +6114,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6431,7 +6348,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..261920b961 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1445,8 +1445,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e72d607a23..ea5030452d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1948,7 +1948,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2058,7 +2058,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2248,7 +2248,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2256,7 +2256,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c10d80f035 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -362,10 +362,10 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     DropRelFileNodesAllBuffers(&rnode, 1);
 
     /*
-     * It'd be nice to tell the stats collector to forget it immediately, too.
-     * But we can't because we don't know the OID (and in cases involving
-     * relfilenode swaps, it's not always clear which table OID to forget,
-     * anyway).
+     * It'd be nice to tell the activity stats facility to forget it
+     * immediately, too.  But we can't because we don't know the OID (and in
+     * cases involving relfilenode swaps, it's not always clear which table OID
+     * to forget, anyway).
      */
 
     /*
@@ -435,8 +435,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5b677863b9..3f906210c2 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3192,6 +3192,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3766,6 +3772,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4168,11 +4175,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4204,6 +4212,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4216,8 +4226,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4252,7 +4267,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4260,6 +4275,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6d66ff8b44..5f80fa755b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1388,7 +1385,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1437,7 +1434,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1452,7 +1449,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1467,7 +1464,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1482,7 +1479,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1497,7 +1494,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1512,7 +1509,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1527,11 +1524,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1546,7 +1543,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1564,7 +1561,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1617,7 +1614,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a7b7b12249..e6b6126141 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -232,9 +232,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..377bc43132 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -745,8 +745,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1476,7 +1476,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1487,7 +1487,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1496,7 +1496,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4588,7 +4588,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..970214f275 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -565,7 +565,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 763c1ee2bd..69bd794806 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,33 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -78,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -111,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -156,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -181,280 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -486,96 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -584,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -600,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -610,29 +246,55 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -652,19 +314,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -679,7 +337,7 @@ typedef struct PgStat_StatFuncEntry
 
 
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -695,7 +353,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -761,7 +419,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1005,7 +662,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1202,13 +859,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1223,29 +882,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 
 extern void pgstat_report_autovac(Oid dboid);
@@ -1406,8 +1062,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1416,11 +1072,15 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 02c52b27190e5778cb92002c5488209b0a188432 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v30 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..7ed2b3884c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7008,11 +7008,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7028,14 +7028,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7066,9 +7065,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8129,7 +8128,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8142,7 +8141,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..b9b73e59f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2381,12 +2381,13 @@ HINT:  Recovery cannot continue unless the configuration is changed and the serv
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 220b8164c3..efdcd6fda8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -596,7 +587,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -914,7 +905,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1197,6 +1188,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1244,7 +1240,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1272,10 +1268,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4177,9 +4169,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From dee2d8ebb706f69470922a065d9a8c771ed7aeb1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v30 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7ed2b3884c..0c251c8ac6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7120,25 +7120,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..2f04bb68bb 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6e76cc40bc..c32f7d19db 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * Shared stats bootstrap information, protected by StatsLock.
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 261920b961..733533b955 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 377bc43132..ce7d060ef4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4251,17 +4250,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11538,35 +11526,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 970214f275..69ba46b5be 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -572,7 +572,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 69bd794806..263f9ace1f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..28b39f6b2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Conflicted with the commit 28cac71bd3 - SLRU stats.

Rebased and fixed some issues.

At Wed, 01 Apr 2020 17:37:23 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Conflicted with 616ae3d2b0, so rebased.
> 
> I made some cleanup. (v30)
> 
> - Added comments for members of dshash_seq_scans.
> - Some style fix and comment fix of dshash.
> 
> - Cleaned up more usage of the word "stat(istics) collector" in comments,
> - Changed the GUC attribute STATS_COLLECTOR to STATS_ACTIVITY
> - Removed duplicate setup of MyBackendType and ps display in PgArchiverMain
> - Removed B_STATS_COLLECTOR from BackendType and removed related code.
> - Corrected the comment of PgArchiverMain, which mentioned "argv/argc".

I made further cleanups.

- Removed wrongly added BACKEND_TYPE_ARCHIVER.

- Moved archiver latch from XLogCtlData to ProcGlobal.

- Removed XLogArchiverStart/End/Wakeup from xlog.h, where anyway was
  not the proper place.

- pgarch_MainLoop start the loop with wakened = true when both
  notified or timed out. Otherwise time_to_stop is set and exits from
  the loop immediately. So the variable wakened is actually
  useless. Removed it.

- A PoC (or a rush work) of refactoring of SLRU stats.

  I tried to combine it into the global stats hash, but the SLRU
  report functions are called within critical sections so memory
  allocation fails. The current pgstat module removes the local entry
  at successful flushing out to shared stats, so allocation at the
  first report is inevitable.  In the attched version it is handled
  the same way with global stats.  I continue seeking a way to
  combining it to the global stats hash.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7a50aa4424c4db32b9d529bd2b5d1c20ac3da8e8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v31 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From fc8429dd9be49fad2dcfef191f2367dadc755343 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v31 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 2b2b7032de446e74c3db429af5fcad30641db2b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v31 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From c15af49f474ac5cd0908fc122e3b4c6c5c6f0a7d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v31 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..2f8672ac0c 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 73d278f3b2..e97995f973 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3056,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3191,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3448,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3653,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3937,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5218,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5261,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5493,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9938cddb57..af4599bd82 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..4435df82b6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d21780108b..a87e7dc711 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -272,6 +272,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From 9dc0074b4eb6198ac71523ed0a4e7d5fa9dfab07 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v31 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   46 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5112 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    1 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |   12 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   75 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  583 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2256 insertions(+), 3906 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca52846b97..d1be516db4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1074,8 +1074,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f9596c718..accc4b7e95 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -589,7 +589,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..9a59e9d0eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8548,9 +8548,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..46fe9fd85f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1699,28 +1699,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb0..a18e7068ae 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e5a5eef102..4151fa5335 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e..eaba7d166e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -321,8 +321,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..19a2357f0d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..12f06a316d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -533,29 +533,29 @@ HandleCheckpointerInterrupts(void)
         ProcessConfigFile(PGC_SIGHUP);
 
         /*
-         * Checkpointer is the last process to shut down, so we ask it to
-         * hold the keys for a range of other tasks required most of which
-         * have nothing to do with checkpointing at all.
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
          *
-         * For various reasons, some config values can change dynamically
-         * so the primary copy of them is held in shared memory to make
-         * sure all backends see the same value.  We make Checkpointer
-         * responsible for updating the shared memory copy if the
-         * parameter setting changes because of SIGHUP.
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
          */
         UpdateSharedMemoryConfig();
     }
     if (ShutdownRequestPending)
     {
         /*
-         * From here on, elog(ERROR) should end with exit(1), not send
-         * control back to the sigsetjmp block above
+         * From here on, elog(ERROR) should end with exit(1), not send control
+         * back to the sigsetjmp block above
          */
         ExitOnAnyError = true;
         /* Close down the database */
         ShutdownXLOG(0, 0);
         /* Normal exit from the checkpointer is here */
-        proc_exit(0);        /* done */
+        proc_exit(0);            /* done */
     }
 }
 
@@ -691,9 +691,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9ebde47dea..3e11d50f86 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,15 +100,24 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
  * SLRU statistics counters (unused in other processes) stored directly in
@@ -158,68 +136,141 @@ static char *slru_names[] = {"async", "clog", "commit_timestamp",
 /* number of elemenents of slru_name array */
 #define SLRU_NUM_ELEMENTS    (sizeof(slru_names) / sizeof(char *))
 
-/* entries in the same order as slru_names */
-PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
 
-/* ----------
- * Local data
- * ----------
- */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -256,11 +307,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -269,20 +319,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStat_StatSLRUEntry *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -291,526 +340,277 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area,
+                          sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStat_StatSLRUEntry *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStat_StatSLRUEntry *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -818,147 +618,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -966,257 +1110,143 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
 
-    if (pgStatFunctions == NULL)
-        return;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(StatsLock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+        
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+
+        /* done, clear the local entry */
+        MemSet(localent, 0, sizeof(PgStat_StatSLRUEntry));
+    }        
+    LWLockRelease(StatsLock);
+
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1225,7 +1255,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1239,7 +1269,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1266,65 +1296,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1333,20 +1483,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1355,29 +1532,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1386,17 +1571,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1412,15 +1622,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1434,48 +1656,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1486,9 +1723,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1496,10 +1734,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1517,158 +1755,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1679,26 +1931,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1711,31 +1952,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1776,9 +2024,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1790,8 +2035,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1811,7 +2055,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1827,116 +2072,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2367,8 +2653,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2413,7 +2699,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2449,7 +2735,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2466,88 +2752,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2561,25 +2935,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2650,9 +3059,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2667,9 +3077,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2681,12 +3092,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2900,8 +3311,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3077,12 +3488,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3095,7 +3509,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3112,6 +3526,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3372,7 +3788,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3667,9 +4084,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4304,94 +4718,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4399,475 +4790,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    int        i;
-
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4884,7 +4831,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4896,38 +4843,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4938,6 +4878,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4970,55 +4912,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5026,8 +4932,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5054,24 +4960,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5105,102 +5018,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5209,7 +5185,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5217,49 +5193,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5273,7 +5230,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5282,76 +5239,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5361,59 +5275,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5426,14 +5331,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5446,25 +5351,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5480,25 +5381,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5514,7 +5410,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5522,304 +5418,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5838,801 +5445,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
 
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    if (pgStatSnapshotContext)
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
+        MemoryContextReset(pgStatSnapshotContext);
 
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz    ts = GetCurrentTimestamp();
-
-    memset(&slruStats, 0, sizeof(slruStats));
-
-    elog(LOG, "msg->m_index = %d", msg->m_index);
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6723,54 +5554,55 @@ pgstat_slru_name(int idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(SlruCtl ctl)
 {
     int        idx = pgstat_slru_index(ctl->shared->lwlock_tranche_name);
 
     Assert((idx >= 0) && (idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[idx];
+    return &local_SLRUStats[idx];
 }
 
+
 void
 pgstat_count_slru_page_zeroed(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_zeroed += 1;
+    slru_entry(ctl)->blocks_zeroed += 1;
 }
 
 void
 pgstat_count_slru_page_hit(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_hit += 1;
+    slru_entry(ctl)->blocks_hit += 1;
 }
 
 void
 pgstat_count_slru_page_exists(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_exists += 1;
+    slru_entry(ctl)->blocks_exists += 1;
 }
 
 void
 pgstat_count_slru_page_read(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_read += 1;
+    slru_entry(ctl)->blocks_read += 1;
 }
 
 void
 pgstat_count_slru_page_written(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_written += 1;
+    slru_entry(ctl)->blocks_written += 1;
 }
 
 void
 pgstat_count_slru_flush(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_flush += 1;
+    slru_entry(ctl)->flush += 1;
 }
 
 void
 pgstat_count_slru_truncate(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_truncate += 1;
+    slru_entry(ctl)->truncate += 1;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e97995f973..0f0f3ece36 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2693,8 +2680,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3057,8 +3042,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3125,13 +3108,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3204,22 +3180,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3680,22 +3640,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3891,8 +3835,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3916,8 +3858,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3927,8 +3868,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4129,8 +4069,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5108,18 +5046,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5238,12 +5164,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6138,7 +6058,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6194,8 +6113,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6430,7 +6347,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..261920b961 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1445,8 +1445,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e72d607a23..ea5030452d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1948,7 +1948,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2058,7 +2058,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2248,7 +2248,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2256,7 +2256,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c10d80f035 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -362,10 +362,10 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     DropRelFileNodesAllBuffers(&rnode, 1);
 
     /*
-     * It'd be nice to tell the stats collector to forget it immediately, too.
-     * But we can't because we don't know the OID (and in cases involving
-     * relfilenode swaps, it's not always clear which table OID to forget,
-     * anyway).
+     * It'd be nice to tell the activity stats facility to forget it
+     * immediately, too.  But we can't because we don't know the OID (and in
+     * cases involving relfilenode swaps, it's not always clear which table OID
+     * to forget, anyway).
      */
 
     /*
@@ -435,8 +435,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..9932ce4dac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 175f4fd26b..91849e038a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1388,7 +1385,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1437,7 +1434,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1452,7 +1449,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1467,7 +1464,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1482,7 +1479,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1497,7 +1494,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1512,7 +1509,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1527,11 +1524,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1546,7 +1543,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1564,7 +1561,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1617,7 +1614,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1703,7 +1700,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext     per_query_ctx;
     MemoryContext     oldcontext;
     int                i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1737,7 +1734,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats    stat = stats[i];
+        PgStat_StatSLRUEntry    *stat = &stats[i];
         char       *name;
 
         name = pgstat_slru_name(i);
@@ -1749,14 +1746,14 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         MemSet(nulls, 0, sizeof(nulls));
 
         values[0] = PointerGetDatum(cstring_to_text(name));
-        values[1] = Int64GetDatum(stat.blocks_zeroed);
-        values[2] = Int64GetDatum(stat.blocks_hit);
-        values[3] = Int64GetDatum(stat.blocks_read);
-        values[4] = Int64GetDatum(stat.blocks_written);
-        values[5] = Int64GetDatum(stat.blocks_exists);
-        values[6] = Int64GetDatum(stat.flush);
-        values[7] = Int64GetDatum(stat.truncate);
-        values[8] = Int64GetDatum(stat.stat_reset_timestamp);
+        values[1] = Int64GetDatum(stat->blocks_zeroed);
+        values[2] = Int64GetDatum(stat->blocks_hit);
+        values[3] = Int64GetDatum(stat->blocks_read);
+        values[4] = Int64GetDatum(stat->blocks_written);
+        values[5] = Int64GetDatum(stat->blocks_exists);
+        values[6] = Int64GetDatum(stat->flush);
+        values[7] = Int64GetDatum(stat->truncate);
+        values[8] = Int64GetDatum(stat->stat_reset_timestamp);
 
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a7b7b12249..e6b6126141 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -232,9 +232,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..fd45e72b44 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1468,7 +1468,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1479,7 +1479,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1488,7 +1488,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4266,7 +4266,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4602,7 +4602,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..03760ca6a4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -571,7 +571,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..5a63840dfe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -16,9 +16,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -42,35 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -81,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -114,18 +86,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -159,6 +130,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -184,308 +159,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by the SLRU to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -517,98 +216,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -617,13 +226,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -633,7 +238,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -643,29 +247,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -685,19 +332,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -710,9 +353,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -728,7 +370,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -746,21 +388,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -809,7 +436,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1055,7 +681,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1252,18 +878,20 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * SLRU statistics counters are updated directly by slru.
  */
-extern PgStat_MsgSLRU SlruStats[];
+//extern PgStat_MsgSLRU SlruStats[];
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +906,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1462,8 +1087,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,13 +1097,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
@@ -1489,5 +1117,6 @@ extern void pgstat_count_slru_flush(SlruCtl ctl);
 extern void pgstat_count_slru_truncate(SlruCtl ctl);
 extern char *pgstat_slru_name(int idx);
 extern int pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 244041acaaa55a3665fc4fc46ef88375cf0322a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v31 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..ca737ee1fc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7035,11 +7035,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7055,14 +7055,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7093,9 +7092,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8156,7 +8155,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8169,7 +8168,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..b9b73e59f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2381,12 +2381,13 @@ HINT:  Recovery cannot continue unless the configuration is changed and the serv
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fd8b17ef8f..95b8c3a884 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -603,7 +594,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -921,7 +912,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1204,6 +1195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1251,7 +1247,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1279,10 +1275,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4282,9 +4274,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From b14795834b8c9b7449de1e9bb0dafe3ee5e2ee05 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v31 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ca737ee1fc..e2eb6d630d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7147,25 +7147,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3e11d50f86..a981b1ca0f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 261920b961..733533b955 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fd45e72b44..8977c5e2da 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4265,17 +4264,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11552,35 +11540,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 03760ca6a4..c76ea4a1e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -578,7 +578,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5a63840dfe..b8d2f1bd32 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,7 +33,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..28b39f6b2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 03 Apr 2020 17:31:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Conflicted with the commit 28cac71bd3 - SLRU stats.
> 
> Rebased and fixed some issues.
> - A PoC (or a rush work) of refactoring of SLRU stats.
> 
>   I tried to combine it into the global stats hash, but the SLRU
>   report functions are called within critical sections so memory
>   allocation fails. The current pgstat module removes the local entry
>   at successful flushing out to shared stats, so allocation at the
>   first report is inevitable.  In the attched version it is handled
>   the same way with global stats.  I continue seeking a way to
>   combining it to the global stats hash.

I didn't find a way to consolidate into the general local stats
hash. The hash size could be large and the chance of allocation
failure is larger than other places where in-critical-section memory
allocation is allowed. Instead, in the attached, I separated
shared-SLRU lock from StatsLock and add the logic to avoid useless
scan on the array.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7a50aa4424c4db32b9d529bd2b5d1c20ac3da8e8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v32 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From fc8429dd9be49fad2dcfef191f2367dadc755343 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v32 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 2b2b7032de446e74c3db429af5fcad30641db2b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v32 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From c15af49f474ac5cd0908fc122e3b4c6c5c6f0a7d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v32 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index d62c12310a..2f8672ac0c 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 73d278f3b2..e97995f973 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3056,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3191,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3448,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3653,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3937,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5208,7 +5218,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5251,16 +5261,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5493,6 +5493,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 9938cddb57..af4599bd82 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 9ec7b31cce..4435df82b6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -313,6 +313,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d21780108b..a87e7dc711 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -272,6 +272,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From 5fd645f8fcd6f93494e04721de1a0a34deb01832 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v32 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   46 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5138 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    1 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |   12 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   75 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  583 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2281 insertions(+), 3907 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ca52846b97..d1be516db4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1074,8 +1074,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f9596c718..accc4b7e95 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -589,7 +589,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 977d448f50..9a59e9d0eb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8548,9 +8548,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..46fe9fd85f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1699,28 +1699,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb0..a18e7068ae 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index e5a5eef102..4151fa5335 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e..eaba7d166e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -321,8 +321,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..19a2357f0d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..12f06a316d 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * Sleep until we are signaled or it's time for another checkpoint or
@@ -533,29 +533,29 @@ HandleCheckpointerInterrupts(void)
         ProcessConfigFile(PGC_SIGHUP);
 
         /*
-         * Checkpointer is the last process to shut down, so we ask it to
-         * hold the keys for a range of other tasks required most of which
-         * have nothing to do with checkpointing at all.
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
          *
-         * For various reasons, some config values can change dynamically
-         * so the primary copy of them is held in shared memory to make
-         * sure all backends see the same value.  We make Checkpointer
-         * responsible for updating the shared memory copy if the
-         * parameter setting changes because of SIGHUP.
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
          */
         UpdateSharedMemoryConfig();
     }
     if (ShutdownRequestPending)
     {
         /*
-         * From here on, elog(ERROR) should end with exit(1), not send
-         * control back to the sigsetjmp block above
+         * From here on, elog(ERROR) should end with exit(1), not send control
+         * back to the sigsetjmp block above
          */
         ExitOnAnyError = true;
         /* Close down the database */
         ShutdownXLOG(0, 0);
         /* Normal exit from the checkpointer is here */
-        proc_exit(0);        /* done */
+        proc_exit(0);            /* done */
     }
 }
 
@@ -691,9 +691,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1238,8 +1238,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9ebde47dea..2df9e858df 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,19 +100,26 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
- * SLRU statistics counters (unused in other processes) stored directly in
- * stats structure so it can be sent without needing to copy things around.
  * We assume this inits to zeroes. There is no central registry of SLRUs,
  * so we use this fixed list instead.
  *
@@ -158,68 +134,153 @@ static char *slru_names[] = {"async", "clog", "commit_timestamp",
 /* number of elemenents of slru_name array */
 #define SLRU_NUM_ELEMENTS    (sizeof(slru_names) / sizeof(char *))
 
-/* entries in the same order as slru_names */
-PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
-
-/* ----------
- * Local data
- * ----------
+/*
+ *  Changes of SLRU counters are reported within critical sections so we use
+ *  static memory in order to avoid memory allocation.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -256,11 +317,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -269,20 +329,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -291,526 +350,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -818,147 +629,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -966,257 +1121,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
 
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    if (!have_slrustats)
+        return true;
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+        
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }        
+
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
+
+    have_slrustats = false;
+
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
+    envlist[n] = NULL;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1225,7 +1272,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1239,7 +1286,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1266,65 +1313,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1333,20 +1500,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1355,29 +1549,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1386,17 +1588,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1412,15 +1639,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1434,48 +1673,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1486,9 +1740,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1496,10 +1751,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1517,158 +1772,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1679,26 +1948,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1711,31 +1969,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1776,9 +2041,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1790,8 +2052,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1811,7 +2072,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1827,116 +2089,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2367,8 +2670,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2413,7 +2716,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2449,7 +2752,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2466,88 +2769,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2561,25 +2952,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2650,9 +3076,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2667,9 +3094,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2681,12 +3109,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2900,8 +3328,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3077,12 +3505,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3095,7 +3526,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3112,6 +3543,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3372,7 +3805,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3667,9 +4101,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4304,94 +4735,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4399,475 +4807,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    int        i;
-
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4884,7 +4848,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4896,38 +4860,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4938,6 +4895,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4970,55 +4929,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5026,8 +4949,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5054,24 +4977,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5105,102 +5035,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5209,7 +5202,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5217,49 +5210,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5273,7 +5247,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5282,76 +5256,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5361,59 +5292,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5426,14 +5348,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5446,25 +5368,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5480,25 +5398,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5514,7 +5427,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5522,304 +5435,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5838,801 +5462,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
 
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
+    if (pgStatSnapshotContext)
     {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
+        MemoryContextReset(pgStatSnapshotContext);
 
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz    ts = GetCurrentTimestamp();
-
-    memset(&slruStats, 0, sizeof(slruStats));
-
-    elog(LOG, "msg->m_index = %d", msg->m_index);
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6723,54 +5571,62 @@ pgstat_slru_name(int idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(SlruCtl ctl)
 {
     int        idx = pgstat_slru_index(ctl->shared->lwlock_tranche_name);
 
     Assert((idx >= 0) && (idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[idx];
+    return &local_SLRUStats[idx];
 }
 
+
 void
 pgstat_count_slru_page_zeroed(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_zeroed += 1;
+    slru_entry(ctl)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_hit += 1;
+    slru_entry(ctl)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_exists += 1;
+    slru_entry(ctl)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_read += 1;
+    slru_entry(ctl)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_written += 1;
+    slru_entry(ctl)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_flush += 1;
+    slru_entry(ctl)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_truncate += 1;
+    slru_entry(ctl)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e97995f973..0f0f3ece36 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2693,8 +2680,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3057,8 +3042,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3125,13 +3108,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3204,22 +3180,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3680,22 +3640,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3891,8 +3835,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3916,8 +3858,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3927,8 +3868,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4129,8 +4069,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5108,18 +5046,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5238,12 +5164,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6138,7 +6058,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6194,8 +6113,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6430,7 +6347,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index a2e28b064c..261920b961 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1445,8 +1445,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e72d607a23..ea5030452d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1948,7 +1948,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2058,7 +2058,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2248,7 +2248,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2256,7 +2256,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 360b5bf5bf..c10d80f035 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -362,10 +362,10 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     DropRelFileNodesAllBuffers(&rnode, 1);
 
     /*
-     * It'd be nice to tell the stats collector to forget it immediately, too.
-     * But we can't because we don't know the OID (and in cases involving
-     * relfilenode swaps, it's not always clear which table OID to forget,
-     * anyway).
+     * It'd be nice to tell the activity stats facility to forget it
+     * immediately, too.  But we can't because we don't know the OID (and in
+     * cases involving relfilenode swaps, it's not always clear which table OID
+     * to forget, anyway).
      */
 
     /*
@@ -435,8 +435,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..9932ce4dac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 175f4fd26b..91849e038a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -33,9 +33,6 @@
 
 #define UINT32_ACCESS_ONCE(var)         ((uint32)(*((volatile uint32 *)&(var))))
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1260,7 +1257,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1276,7 +1273,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1292,7 +1289,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1308,7 +1305,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1324,7 +1321,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1340,7 +1337,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1356,7 +1353,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1372,7 +1369,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1388,7 +1385,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1421,7 +1418,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1437,7 +1434,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1452,7 +1449,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1467,7 +1464,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1482,7 +1479,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1497,7 +1494,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1512,7 +1509,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1527,11 +1524,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1546,7 +1543,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1564,7 +1561,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1601,7 +1598,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1617,7 +1614,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1703,7 +1700,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext     per_query_ctx;
     MemoryContext     oldcontext;
     int                i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1737,7 +1734,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats    stat = stats[i];
+        PgStat_StatSLRUEntry    *stat = &stats[i];
         char       *name;
 
         name = pgstat_slru_name(i);
@@ -1749,14 +1746,14 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         MemSet(nulls, 0, sizeof(nulls));
 
         values[0] = PointerGetDatum(cstring_to_text(name));
-        values[1] = Int64GetDatum(stat.blocks_zeroed);
-        values[2] = Int64GetDatum(stat.blocks_hit);
-        values[3] = Int64GetDatum(stat.blocks_read);
-        values[4] = Int64GetDatum(stat.blocks_written);
-        values[5] = Int64GetDatum(stat.blocks_exists);
-        values[6] = Int64GetDatum(stat.flush);
-        values[7] = Int64GetDatum(stat.truncate);
-        values[8] = Int64GetDatum(stat.stat_reset_timestamp);
+        values[1] = Int64GetDatum(stat->blocks_zeroed);
+        values[2] = Int64GetDatum(stat->blocks_hit);
+        values[3] = Int64GetDatum(stat->blocks_read);
+        values[4] = Int64GetDatum(stat->blocks_written);
+        values[5] = Int64GetDatum(stat->blocks_exists);
+        values[6] = Int64GetDatum(stat->flush);
+        values[7] = Int64GetDatum(stat->truncate);
+        values[8] = Int64GetDatum(stat->stat_reset_timestamp);
 
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index a7b7b12249..e6b6126141 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -232,9 +232,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..fd45e72b44 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1468,7 +1468,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1479,7 +1479,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1488,7 +1488,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4266,7 +4266,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4602,7 +4602,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..03760ca6a4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -571,7 +571,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 3c70499feb..927ae319b1 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 107;
+use Test::More tests => 106;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -123,7 +123,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..5a63840dfe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -16,9 +16,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -42,35 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -81,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -114,18 +86,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -159,6 +130,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -184,308 +159,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by the SLRU to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -517,98 +216,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -617,13 +226,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -633,7 +238,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -643,29 +247,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -685,19 +332,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -710,9 +353,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -728,7 +370,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -746,21 +388,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -809,7 +436,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1055,7 +681,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1252,18 +878,20 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * SLRU statistics counters are updated directly by slru.
  */
-extern PgStat_MsgSLRU SlruStats[];
+//extern PgStat_MsgSLRU SlruStats[];
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +906,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1462,8 +1087,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,13 +1097,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
@@ -1489,5 +1117,6 @@ extern void pgstat_count_slru_flush(SlruCtl ctl);
 extern void pgstat_count_slru_truncate(SlruCtl ctl);
 extern char *pgstat_slru_name(int idx);
 extern int pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 4efe988a0d40218cb9e523ff49403a25619786f5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v32 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 64614b569c..8bd8fc4d5f 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8151,9 +8151,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..ca737ee1fc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7035,11 +7035,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7055,14 +7055,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7093,9 +7092,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8156,7 +8155,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8169,7 +8168,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index b5d32bb720..b9b73e59f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2381,12 +2381,13 @@ HINT:  Recovery cannot continue unless the configuration is changed and the serv
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fd8b17ef8f..95b8c3a884 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -603,7 +594,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -921,7 +912,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1204,6 +1195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1251,7 +1247,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1279,10 +1275,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4282,9 +4274,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From a467bb1346a43f6aa6f72f8c6a59be5dfdf8aab3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v32 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ca737ee1fc..e2eb6d630d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7147,25 +7147,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2df9e858df..b89d3eea73 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 261920b961..733533b955 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -254,7 +254,6 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -273,8 +272,6 @@ perform_base_backup(basebackup_options *opt)
                                      backup_total);
     }
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -306,13 +303,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fd45e72b44..8977c5e2da 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -197,7 +197,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4265,17 +4264,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11552,35 +11540,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 03760ca6a4..c76ea4a1e7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -578,7 +578,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5a63840dfe..b8d2f1bd32 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,7 +33,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..28b39f6b2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
This conflicts with several recent commits. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 728179127cdf26b23219aae3c4893e40361cedad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v33 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From cde7db2a80575076951b01736be8c1e7369d9d94 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v33 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From a55eb747dfcd6e314efcd756dde4dcdcdac23089 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v33 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 145330976799132e9a70422ae73257c747fe885d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v33 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 55becd65d4..2417de72a0 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0578e92ba9..2687d2af47 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3055,7 +3056,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3190,20 +3191,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3451,7 +3448,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3656,6 +3653,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3928,6 +3937,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5203,7 +5213,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5246,16 +5256,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5488,6 +5488,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5aa19d3f78..7acc48734e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a12afb59e..099da8cd76 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -336,6 +336,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ae4f573ab4..1a8a0c2e15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -272,6 +272,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From b031ba7ef781b87da3fa27b43bad902dec54c290 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v33 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   46 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5130 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    1 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |   12 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   75 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  583 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2277 insertions(+), 3901 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1..80a3c50994 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1059,8 +1059,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3c18db29f1..980f16c169 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -595,7 +595,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 065eb275b1..09a11f8220 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8570,9 +8570,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7cfbdd57db..f3779dc4aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1698,28 +1698,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 367c30adb0..a18e7068ae 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..d0015eb411 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e97ffab27..19a2357f0d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 2b552a7ff9..ebc4bce72c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -540,29 +540,29 @@ HandleCheckpointerInterrupts(void)
         ProcessConfigFile(PGC_SIGHUP);
 
         /*
-         * Checkpointer is the last process to shut down, so we ask it to
-         * hold the keys for a range of other tasks required most of which
-         * have nothing to do with checkpointing at all.
+         * Checkpointer is the last process to shut down, so we ask it to hold
+         * the keys for a range of other tasks required most of which have
+         * nothing to do with checkpointing at all.
          *
-         * For various reasons, some config values can change dynamically
-         * so the primary copy of them is held in shared memory to make
-         * sure all backends see the same value.  We make Checkpointer
-         * responsible for updating the shared memory copy if the
-         * parameter setting changes because of SIGHUP.
+         * For various reasons, some config values can change dynamically so
+         * the primary copy of them is held in shared memory to make sure all
+         * backends see the same value.  We make Checkpointer responsible for
+         * updating the shared memory copy if the parameter setting changes
+         * because of SIGHUP.
          */
         UpdateSharedMemoryConfig();
     }
     if (ShutdownRequestPending)
     {
         /*
-         * From here on, elog(ERROR) should end with exit(1), not send
-         * control back to the sigsetjmp block above
+         * From here on, elog(ERROR) should end with exit(1), not send control
+         * back to the sigsetjmp block above
          */
         ExitOnAnyError = true;
         /* Close down the database */
         ShutdownXLOG(0, 0);
         /* Normal exit from the checkpointer is here */
-        proc_exit(0);        /* done */
+        proc_exit(0);            /* done */
     }
 }
 
@@ -698,9 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1245,8 +1245,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..2a0cde993f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,19 +100,26 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
- * SLRU statistics counters (unused in other processes) stored directly in
- * stats structure so it can be sent without needing to copy things around.
  * We assume this inits to zeroes. There is no central registry of SLRUs,
  * so we use this fixed list instead.
  *
@@ -158,68 +134,153 @@ static char *slru_names[] = {"async", "clog", "commit_timestamp",
 /* number of elemenents of slru_name array */
 #define SLRU_NUM_ELEMENTS    (sizeof(slru_names) / sizeof(char *))
 
-/* entries in the same order as slru_names */
-PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
-
-/* ----------
- * Local data
- * ----------
+/*
+ *  Changes of SLRU counters are reported within critical sections so we use
+ *  static memory in order to avoid memory allocation.
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -256,11 +317,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -269,20 +329,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -291,526 +350,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -818,147 +629,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -966,257 +1121,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    if (pgStatFunctions == NULL)
-        return;
+    if (!have_slrustats)
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    have_slrustats = false;
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1225,7 +1272,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1239,7 +1286,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1266,65 +1313,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1333,20 +1500,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1355,29 +1549,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1386,17 +1588,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1412,15 +1639,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1434,48 +1673,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1486,9 +1740,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1496,10 +1751,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1517,158 +1772,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1679,26 +1948,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1711,31 +1969,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1776,9 +2041,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1790,8 +2052,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1811,7 +2072,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1827,116 +2089,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2367,8 +2670,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2413,7 +2716,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2449,7 +2752,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2466,88 +2769,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2561,25 +2952,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2650,9 +3076,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2667,9 +3094,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2681,12 +3109,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2900,8 +3328,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3077,12 +3505,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3095,7 +3526,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3112,6 +3543,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3372,7 +3805,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3667,9 +4101,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4304,94 +4735,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4399,475 +4807,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    int        i;
-
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4884,7 +4848,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4896,38 +4860,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4938,6 +4895,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4970,55 +4929,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5026,8 +4949,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5054,24 +4977,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5105,102 +5035,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5209,7 +5202,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5217,49 +5210,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5273,7 +5247,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5282,76 +5256,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5361,59 +5292,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5426,14 +5348,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5446,25 +5368,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5480,25 +5398,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5514,7 +5427,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5522,304 +5435,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5838,799 +5462,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz    ts = GetCurrentTimestamp();
-
-    memset(&slruStats, 0, sizeof(slruStats));
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an
-     * "insert" autovacuum, which are mainly intended to freeze newly inserted
-     * tuples.  Zeroing this may just mean we'll not try to vacuum the table
-     * again until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
 
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
+    if (pgStatSnapshotContext)
     {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+        MemoryContextReset(pgStatSnapshotContext);
 
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6721,54 +5571,62 @@ pgstat_slru_name(int idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(SlruCtl ctl)
 {
     int        idx = pgstat_slru_index(ctl->shared->lwlock_tranche_name);
 
     Assert((idx >= 0) && (idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[idx];
+    return &local_SLRUStats[idx];
 }
 
+
 void
 pgstat_count_slru_page_zeroed(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_zeroed += 1;
+    slru_entry(ctl)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_hit += 1;
+    slru_entry(ctl)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_exists += 1;
+    slru_entry(ctl)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_read += 1;
+    slru_entry(ctl)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_blocks_written += 1;
+    slru_entry(ctl)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_flush += 1;
+    slru_entry(ctl)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(SlruCtl ctl)
 {
-    slru_entry(ctl)->m_truncate += 1;
+    slru_entry(ctl)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2687d2af47..214d941ad7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2693,8 +2680,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3057,8 +3042,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3125,13 +3108,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3204,22 +3180,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3680,22 +3640,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3891,8 +3835,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3916,8 +3858,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3927,8 +3868,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4129,8 +4069,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5103,18 +5041,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5233,12 +5159,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6133,7 +6053,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6189,8 +6108,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6425,7 +6342,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index fbdc28ec39..984719c166 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1550,8 +1550,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f9980cf80c..3684aa8a33 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2010,7 +2010,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2120,7 +2120,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2310,7 +2310,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2318,7 +2318,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4c14e51c67..f61fd3e8ad 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -523,6 +523,7 @@ RegisterLWLockTranches(void)
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
     LWLockRegisterTranche(LWTRANCHE_SXACT, "serializable_xact");
+    LWLockRegisterTranche(LWTRANCHE_STATS, "activity_statistics");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b053a4dc76..4dae9c3938 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -362,10 +362,10 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
     DropRelFileNodesAllBuffers(&rnode, 1);
 
     /*
-     * It'd be nice to tell the stats collector to forget it immediately, too.
-     * But we can't because we don't know the OID (and in cases involving
-     * relfilenode swaps, it's not always clear which table OID to forget,
-     * anyway).
+     * It'd be nice to tell the activity stats facility to forget it
+     * immediately, too.  But we can't because we don't know the OID (and in
+     * cases involving relfilenode swaps, it's not always clear which table OID
+     * to forget, anyway).
      */
 
     /*
@@ -470,8 +470,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..9932ce4dac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 446044609e..7f92190665 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1261,7 +1258,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1277,7 +1274,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1293,7 +1290,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1309,7 +1306,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1325,7 +1322,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1341,7 +1338,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1357,7 +1354,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1373,7 +1370,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1389,7 +1386,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1422,7 +1419,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1438,7 +1435,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1453,7 +1450,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1468,7 +1465,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1483,7 +1480,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1498,7 +1495,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1513,7 +1510,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1528,11 +1525,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1547,7 +1544,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1565,7 +1562,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1602,7 +1599,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1618,7 +1615,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1704,7 +1701,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext     per_query_ctx;
     MemoryContext     oldcontext;
     int                i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1738,7 +1735,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats    stat = stats[i];
+        PgStat_StatSLRUEntry    *stat = &stats[i];
         char       *name;
 
         name = pgstat_slru_name(i);
@@ -1750,14 +1747,14 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         MemSet(nulls, 0, sizeof(nulls));
 
         values[0] = PointerGetDatum(cstring_to_text(name));
-        values[1] = Int64GetDatum(stat.blocks_zeroed);
-        values[2] = Int64GetDatum(stat.blocks_hit);
-        values[3] = Int64GetDatum(stat.blocks_read);
-        values[4] = Int64GetDatum(stat.blocks_written);
-        values[5] = Int64GetDatum(stat.blocks_exists);
-        values[6] = Int64GetDatum(stat.flush);
-        values[7] = Int64GetDatum(stat.truncate);
-        values[8] = Int64GetDatum(stat.stat_reset_timestamp);
+        values[1] = Int64GetDatum(stat->blocks_zeroed);
+        values[2] = Int64GetDatum(stat->blocks_hit);
+        values[3] = Int64GetDatum(stat->blocks_read);
+        values[4] = Int64GetDatum(stat->blocks_written);
+        values[5] = Int64GetDatum(stat->blocks_exists);
+        values[6] = Int64GetDatum(stat->flush);
+        values[7] = Int64GetDatum(stat->truncate);
+        values[8] = Int64GetDatum(stat->stat_reset_timestamp);
 
         tuplestore_putvalues(tupstore, tupdesc, values, nulls);
     }
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 6fe25c023a..4a682665d8 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -241,9 +241,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5bdc02fce2..fe2699bb7a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -747,8 +747,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1478,7 +1478,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1489,7 +1489,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1498,7 +1498,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..5745ef09ad 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -574,7 +574,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 63381764e9..eea4d317a7 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index b8041d9988..5a63840dfe 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -16,9 +16,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -42,35 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -81,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -114,18 +86,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -159,6 +130,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -184,308 +159,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by the SLRU to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -517,98 +216,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -617,13 +226,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -633,7 +238,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -643,29 +247,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -685,19 +332,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -710,9 +353,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -728,7 +370,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -746,21 +388,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -809,7 +436,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1055,7 +681,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1252,18 +878,20 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * SLRU statistics counters are updated directly by slru.
  */
-extern PgStat_MsgSLRU SlruStats[];
+//extern PgStat_MsgSLRU SlruStats[];
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +906,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1462,8 +1087,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,13 +1097,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(SlruCtl ctl);
 extern void pgstat_count_slru_page_hit(SlruCtl ctl);
@@ -1489,5 +1117,6 @@ extern void pgstat_count_slru_flush(SlruCtl ctl);
 extern void pgstat_count_slru_truncate(SlruCtl ctl);
 extern char *pgstat_slru_name(int idx);
 extern int pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From a5e5f136f9a5121ec4e6251a848c02085051378d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v33 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 133 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index ce33df9e58..20ee1d5a83 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8161,9 +8161,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a14df06292..d04b2e796c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7096,11 +7096,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7116,14 +7116,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7154,9 +7153,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8217,7 +8216,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8230,7 +8229,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 52e47379cc..7c8bac2bb2 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2360,12 +2360,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc400b..6653de82e8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -603,7 +594,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -921,7 +912,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
       <tbody>
        <row>
-        <entry morerows="64"><literal>LWLock</literal></entry>
+        <entry morerows="65"><literal>LWLock</literal></entry>
         <entry><literal>ShmemIndexLock</literal></entry>
         <entry>Waiting to find or allocate space in shared memory.</entry>
        </row>
@@ -1204,6 +1195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to allocate or exchange a chunk of memory or update
          counters during Parallel Hash plan execution.</entry>
         </row>
+        <row>
+         <entry><literal>activity_statistics</literal></entry>
+         <entry>Waiting to write out activity statistics to shared memory
+         during transaction end or idle time.</entry>
+        </row>
         <row>
          <entry morerows="9"><literal>Lock</literal></entry>
          <entry><literal>relation</literal></entry>
@@ -1251,7 +1247,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting to acquire a pin on a buffer.</entry>
         </row>
         <row>
-         <entry morerows="12"><literal>Activity</literal></entry>
+         <entry morerows="11"><literal>Activity</literal></entry>
          <entry><literal>ArchiverMain</literal></entry>
          <entry>Waiting in main loop of the archiver process.</entry>
         </row>
@@ -1279,10 +1275,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>LogicalLauncherMain</literal></entry>
          <entry>Waiting in main loop of logical launcher process.</entry>
         </row>
-        <row>
-         <entry><literal>PgStatMain</literal></entry>
-         <entry>Waiting in main loop of the statistics collector process.</entry>
-        </row>
         <row>
          <entry><literal>RecoveryWalStream</literal></entry>
          <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -4282,9 +4274,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index a9bc397165..a9289f84b0 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From 9bab43b81f6d95073ce8ae393d0b94c1a5ab2175 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v33 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d04b2e796c..77ffa6f678 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7208,25 +7208,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2a0cde993f..505ea8eb14 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 984719c166..a71577e302 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -269,7 +269,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -292,8 +291,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -327,13 +324,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fe2699bb7a..ab75fbf2cb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -198,7 +198,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4299,17 +4298,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11586,35 +11574,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5745ef09ad..205a823191 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -581,7 +581,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5a63840dfe..b8d2f1bd32 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,7 +33,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..28b39f6b2a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 07 Apr 2020 16:38:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 03 Apr 2020 17:31:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Conflicted with the commit 28cac71bd3 - SLRU stats.
> > 
> > Rebased and fixed some issues.
> > - A PoC (or a rush work) of refactoring of SLRU stats.
> > 
> >   I tried to combine it into the global stats hash, but the SLRU
> >   report functions are called within critical sections so memory
> >   allocation fails. The current pgstat module removes the local entry
> >   at successful flushing out to shared stats, so allocation at the
> >   first report is inevitable.  In the attched version it is handled
> >   the same way with global stats.  I continue seeking a way to
> >   combining it to the global stats hash.
> 
> I didn't find a way to consolidate into the general local stats
> hash. The hash size could be large and the chance of allocation
> failure is larger than other places where in-critical-section memory
> allocation is allowed. Instead, in the attached, I separated
> shared-SLRU lock from StatsLock and add the logic to avoid useless
> scan on the array.

Maybe 29c3e2dd5a and recent doc change hit this. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b6ed9aac61025d967719208f17d103e351e48cd6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v33 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From f9e7b5ea34b12d4cbaa9b892f703337dd306527c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v33 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 7a8251991ff5109714a9b273baeeccc23939cceb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v33 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 5a8e0b3069fdc4d44b761e5fa0e81cb45e6c0366 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v33 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cdd586fcfb..ee3444284b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 160afe9f39..9de9396628 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3068,7 +3069,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3203,20 +3204,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3464,7 +3461,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3669,6 +3666,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3941,6 +3950,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5216,7 +5226,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5259,16 +5269,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5501,6 +5501,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5aa19d3f78..7acc48734e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..0783692c83 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -336,6 +336,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index ae4f573ab4..1a8a0c2e15 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -272,6 +272,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From 6ee6802be87094928d38fc5df70a95a59fb31576 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v33 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5125 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   59 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  581 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2256 insertions(+), 3872 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1..80a3c50994 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1059,8 +1059,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bef0e124b..4ac4e7c03d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -595,7 +595,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..474db9fde3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8575,9 +8575,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7cfbdd57db..f3779dc4aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1698,28 +1698,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..d0015eb411 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6829ff3e8f..4f837a77a5 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 34ed9f7887..ebc4bce72c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -698,9 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1245,8 +1245,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e246be388b..13d1e92f7b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,15 +100,24 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
@@ -160,72 +138,159 @@ static const char *const slru_names[] = {
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +327,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +339,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,526 +360,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +639,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,257 +1131,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    if (pgStatFunctions == NULL)
-        return;
+    if (!have_slrustats)
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    have_slrustats = false;
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1231,7 +1282,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1245,7 +1296,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1272,65 +1323,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1510,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1559,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1598,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1418,15 +1649,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1440,48 +1683,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1750,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1761,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,158 +1782,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1685,26 +1958,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1717,31 +1979,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2051,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2062,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1817,7 +2082,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1833,116 +2099,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2373,8 +2680,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2419,7 +2726,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2455,7 +2762,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2472,88 +2779,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2567,25 +2962,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2656,9 +3086,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2673,9 +3104,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2687,12 +3119,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2906,8 +3338,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3083,12 +3515,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3101,7 +3536,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3118,6 +3553,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3378,7 +3815,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3673,9 +4111,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4310,94 +4745,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4405,473 +4817,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4888,7 +4858,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4900,38 +4870,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4942,6 +4905,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4974,55 +4939,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5030,8 +4959,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5058,24 +4987,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5109,102 +5045,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5213,7 +5212,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5221,49 +5220,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5277,7 +5257,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5286,76 +5266,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5365,59 +5302,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5430,14 +5358,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5450,25 +5378,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5484,25 +5408,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5518,7 +5437,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5526,304 +5445,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5842,797 +5472,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
 
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
+    if (pgStatSnapshotContext)
     {
-        char        statfile[MAXPGPATH];
+        MemoryContextReset(pgStatSnapshotContext);
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6721,7 +5579,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(int slru_idx)
 {
     /*
@@ -6732,7 +5590,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6742,41 +5600,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9de9396628..def10c25a9 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2706,8 +2693,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3070,8 +3055,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3138,13 +3121,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3217,22 +3193,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3693,22 +3653,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3904,8 +3848,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3929,8 +3871,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3940,8 +3881,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4142,8 +4082,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5116,18 +5054,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5246,12 +5172,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6146,7 +6066,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6202,8 +6121,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6438,7 +6355,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index fbdc28ec39..984719c166 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1550,8 +1550,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a..32c511542d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2010,7 +2010,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2120,7 +2120,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2310,7 +2310,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2318,7 +2318,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 61bec10b79..ab724eb9c6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -173,7 +173,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "parallel_append",
     /* LWTRANCHE_SXACT: */
-    "serializable_xact"
+    "serializable_xact",
+    /* LWTRANCHE_STATS: */
+    "activity_statistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..97eccb35d3 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7d667c6586..3535089755 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..9932ce4dac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2aff739466..b9e78dd8e8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1261,7 +1258,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1277,7 +1274,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1293,7 +1290,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1309,7 +1306,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1325,7 +1322,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1341,7 +1338,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1357,7 +1354,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1373,7 +1370,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1389,7 +1386,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1422,7 +1419,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1438,7 +1435,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1453,7 +1450,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1468,7 +1465,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1483,7 +1480,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1498,7 +1495,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1513,7 +1510,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1528,11 +1525,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1547,7 +1544,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1565,7 +1562,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1602,7 +1599,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1618,7 +1615,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1704,7 +1701,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
     int            i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1738,7 +1735,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats stat = stats[i];
+        PgStat_StatSLRUEntry    stat = stats[i];
         const char *name;
 
         name = pgstat_slru_name(i);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cca9704d2d..430e997ca6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -241,9 +241,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..f073025f69 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -747,8 +747,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1478,7 +1478,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1489,7 +1489,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1498,7 +1498,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 995b6ca155..5745ef09ad 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -574,7 +574,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 208df557b8..4d07829ecd 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ae9a39573c..6a809c70d6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -113,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -158,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -183,308 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -516,98 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +246,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,19 +331,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -709,9 +352,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -727,7 +369,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -745,21 +387,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -808,7 +435,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1054,7 +680,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1251,13 +877,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1272,29 +900,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1456,8 +1081,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1466,13 +1091,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1483,5 +1111,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8fda8e4f78..13371e8cb7 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -220,6 +220,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_SXACT,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 12356ae2d835274cb402b60ea73fad07d3d6a6e0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v33 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  29 +++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 128 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 88 insertions(+), 97 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 9d8fa0bec3..4d93722af5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9175,9 +9175,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9f2a4a2470..b84e3f27b2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7110,11 +7110,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7130,14 +7130,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7168,9 +7167,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8230,7 +8229,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previous activity statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8243,7 +8242,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
         fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
+        activity statistics. The total number of heap tuples is stored in
         the index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 44cc5d2116..c958337ac8 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2366,12 +2366,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 87502a49b6..94f5d21243 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -603,7 +594,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1038,10 +1029,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of the statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting for WAL from a stream at recovery.</entry>
@@ -1940,6 +1927,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>XidGenLock</literal></entry>
       <entry>Waiting to allocate or assign a transaction id.</entry>
      </row>
+     <row>
+      <entry><literal>activity_statistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>async</literal></entry>
       <entry>Waiting for I/O on an async (notify) buffer.</entry>
@@ -5493,9 +5484,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 197b5c0d70..8e205f5b2b 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From 59d4536a5664e880c5d68b24fb9ca7adf850881d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v33 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b84e3f27b2..d52e4197f9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7222,25 +7222,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 13d1e92f7b..e4edb28a88 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 984719c166..a71577e302 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -269,7 +269,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -292,8 +291,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -327,13 +324,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f073025f69..d19cbb77c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -198,7 +198,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4299,17 +4298,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11586,35 +11574,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5745ef09ad..205a823191 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -581,7 +581,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6a809c70d6..29e3e689c4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 3f3a1d81f6..68c3a33432 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Hi.

Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche
name is changed from underscore-connected style into camel case
according the change.  On the way fixing that I rewrote the
descritions for vacuum_cleanup_index_scale_factor.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 456c43cad15295279530bf7b9368da640563bd9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v34 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From feca4093f067c45b26c5be237be60f5b9f26cac0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v34 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From 02779cd11b43c8321d69590856a2a5fd2d7f51c7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v34 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From fb3f9d43835eb68fb1a3e2f74fb35f8c5569f843 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v34 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cdd586fcfb..ee3444284b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 160afe9f39..9de9396628 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3068,7 +3069,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3203,20 +3204,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3464,7 +3461,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3669,6 +3666,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3941,6 +3950,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5216,7 +5226,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5259,16 +5269,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5501,6 +5501,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f5eef6fa4e..46f0198510 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..0783692c83 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -336,6 +336,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..619b2f9c71 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1ee9000b2b..8ab6d859bb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -270,6 +270,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From 96f9a302630c4ec516492eeb3a4ae40df334e1ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v34 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5125 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   59 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  581 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2256 insertions(+), 3872 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1..80a3c50994 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1059,8 +1059,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bef0e124b..4ac4e7c03d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -595,7 +595,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..474db9fde3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8575,9 +8575,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7cfbdd57db..f3779dc4aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1698,28 +1698,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..d0015eb411 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a8d4dfdd7c..6936acb839 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1665,12 +1662,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1938,8 +1935,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1958,17 +1953,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2011,9 +2000,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2092,8 +2078,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2176,8 +2162,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2736,29 +2722,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2779,17 +2742,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2813,8 +2771,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2936,7 +2894,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2946,8 +2904,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 34ed9f7887..ebc4bce72c 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -698,9 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1245,8 +1245,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d7f99d9944..76b59df408 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,15 +100,24 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
@@ -160,72 +138,159 @@ static const char *const slru_names[] = {
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +327,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +339,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,526 +360,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +639,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,257 +1131,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    if (pgStatFunctions == NULL)
-        return;
+    if (!have_slrustats)
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    have_slrustats = false;
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1231,7 +1282,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1245,7 +1296,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1272,65 +1323,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1510,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1559,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1598,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1418,15 +1649,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1440,48 +1683,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1750,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1761,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,158 +1782,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1685,26 +1958,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1717,31 +1979,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2051,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2062,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1817,7 +2082,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1833,116 +2099,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2373,8 +2680,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2419,7 +2726,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2455,7 +2762,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2472,88 +2779,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2567,25 +2962,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2656,9 +3086,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2673,9 +3104,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2687,12 +3119,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2906,8 +3338,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3083,12 +3515,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3101,7 +3536,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3118,6 +3553,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3378,7 +3815,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3673,9 +4111,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4310,94 +4745,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4405,473 +4817,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4888,7 +4858,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4900,38 +4870,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4942,6 +4905,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4974,55 +4939,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5030,8 +4959,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5058,24 +4987,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5109,102 +5045,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5213,7 +5212,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5221,49 +5220,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5277,7 +5257,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5286,76 +5266,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5365,59 +5302,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5430,14 +5358,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5450,25 +5378,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5484,25 +5408,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5518,7 +5437,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5526,304 +5445,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5842,797 +5472,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
 
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
+    if (pgStatSnapshotContext)
     {
-        char        statfile[MAXPGPATH];
+        MemoryContextReset(pgStatSnapshotContext);
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signalling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6721,7 +5579,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(int slru_idx)
 {
     /*
@@ -6732,7 +5590,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6742,41 +5600,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9de9396628..def10c25a9 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2706,8 +2693,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3070,8 +3055,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3138,13 +3121,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3217,22 +3193,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3693,22 +3653,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3904,8 +3848,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3929,8 +3871,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3940,8 +3881,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4142,8 +4082,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5116,18 +5054,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5246,12 +5172,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6146,7 +6066,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6202,8 +6121,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6438,7 +6355,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3b46bfe9ab..b6c7a8bc3c 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1550,8 +1550,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a..32c511542d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2010,7 +2010,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2120,7 +2120,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2310,7 +2310,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2318,7 +2318,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8eed..cdd7095ebf 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 XactTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7d667c6586..3535089755 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..9932ce4dac 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2aff739466..b9e78dd8e8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1261,7 +1258,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1277,7 +1274,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1293,7 +1290,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1309,7 +1306,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1325,7 +1322,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1341,7 +1338,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1357,7 +1354,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1373,7 +1370,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1389,7 +1386,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1422,7 +1419,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1438,7 +1435,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1453,7 +1450,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1468,7 +1465,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1483,7 +1480,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1498,7 +1495,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1513,7 +1510,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1528,11 +1525,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1547,7 +1544,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1565,7 +1562,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1602,7 +1599,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1618,7 +1615,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1704,7 +1701,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
     int            i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1738,7 +1735,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats stat = stats[i];
+        PgStat_StatSLRUEntry    stat = stats[i];
         const char *name;
 
         name = pgstat_slru_name(i);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cca9704d2d..430e997ca6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -241,9 +241,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..f073025f69 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -747,8 +747,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1478,7 +1478,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1489,7 +1489,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1498,7 +1498,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 81055edde7..14b3ad4363 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -574,7 +574,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 208df557b8..4d07829ecd 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 619b2f9c71..23984d6e24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..eee9feb8f7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -113,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -158,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -183,308 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -516,98 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +246,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,19 +331,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -709,9 +352,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -727,7 +369,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -745,21 +387,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -808,7 +435,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1054,7 +680,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1251,13 +877,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1272,29 +900,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1456,8 +1081,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1466,13 +1091,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1483,5 +1111,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c04ae97148..ce2b79d1bc 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From 3c119761d0ce67e8a003972bb6f21e54ee2697da Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v34 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 128 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..fad9f94ddb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9177,9 +9177,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4eef970d41..8da4b60fe9 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7113,11 +7113,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7133,14 +7133,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7171,9 +7170,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8234,7 +8233,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8246,9 +8245,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 65c3fc62a9..56ef2f56f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2368,12 +2368,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 321a0f4bb1..374403a25a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +611,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1060,10 +1051,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1788,6 +1775,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5733,9 +5724,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 197b5c0d70..8e205f5b2b 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,11 +1289,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From 7e04ecffba77843be286f2afd0dd7cecb49fe0b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v34 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8da4b60fe9..e65cea9c8e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7226,25 +7226,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index ea08d0b614..71a8b8b11a 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 76b59df408..9f24777fb7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b6c7a8bc3c..8bd01c9047 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -269,7 +269,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -292,8 +291,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -327,13 +324,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f073025f69..d19cbb77c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -198,7 +198,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4299,17 +4298,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11586,35 +11574,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 14b3ad4363..18aedbbbdd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -581,7 +581,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index eee9feb8f7..c50c73ef27 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1407359aef..25fde8a3a4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche

Hmm. This conflicts with 0fd2a79a63. Reabsed on it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4926e50e7635548f86dcd0d36cbf56d168a5d242 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v35 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.2

From a9ad439e8b84d6523e2ed8bcf156e66b5a447b50 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v35 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.2

From c31e2119b78acabdf7128bba21ccf89a16451554 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v35 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.2

From 8a9ab412c3f71e81802649234b7dfcdc5bc23120 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v35 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cdd586fcfb..ee3444284b 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5480a024e0..d398ce6f03 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -319,6 +319,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -439,30 +442,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b4d475bb0b..30f3fc6ba1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -539,6 +539,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3068,7 +3069,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3203,20 +3204,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3464,7 +3461,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3669,6 +3666,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3941,6 +3950,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5216,7 +5226,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5259,16 +5269,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5501,6 +5501,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f5eef6fa4e..46f0198510 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -187,6 +187,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e917dfe92d..0783692c83 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -336,6 +336,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b90..cf0b463eba 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -417,6 +417,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -429,6 +430,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1ee9000b2b..8ab6d859bb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -270,6 +270,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.2

From c7fec2305a49509cf9506e1e09ce5bafece6c662 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v35 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5125 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   59 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  581 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2256 insertions(+), 3872 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 56b35622f1..80a3c50994 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1059,8 +1059,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3bef0e124b..4ac4e7c03d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -595,7 +595,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          new_live_tuples,
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca09d81b08..474db9fde3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8575,9 +8575,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cdc01c49c9..d33014f915 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1702,28 +1702,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..06b03cb8e1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -655,10 +655,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d32de23e62..3953964038 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 9c7d4b0c60..374ddb5d60 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1664,12 +1661,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1937,8 +1934,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1957,17 +1952,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2010,9 +1999,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2091,8 +2077,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2175,8 +2161,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2735,29 +2721,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2778,17 +2741,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2812,8 +2770,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2935,7 +2893,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2945,8 +2903,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..dd36b4965e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -698,9 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1245,8 +1245,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 309378ae54..76b59df408 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,68 +35,43 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
-#include "utils/timestamp.h"
 
 /* ----------
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +86,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,15 +100,24 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
@@ -160,72 +138,159 @@ static const char *const slru_names[] = {
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +327,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +339,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,526 +360,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +639,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,257 +1131,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    if (pgStatFunctions == NULL)
-        return;
+    if (!have_slrustats)
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    have_slrustats = false;
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1231,7 +1282,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1245,7 +1296,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1272,65 +1323,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1510,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1559,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1598,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1418,15 +1649,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1440,48 +1683,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1750,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1761,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,158 +1782,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1685,26 +1958,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1717,31 +1979,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2051,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2062,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1817,7 +2082,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1833,116 +2099,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2373,8 +2680,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2419,7 +2726,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2455,7 +2762,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2472,88 +2779,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2567,25 +2962,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2656,9 +3086,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2673,9 +3104,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2687,12 +3119,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2906,8 +3338,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3083,12 +3515,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3101,7 +3536,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3118,6 +3553,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3378,7 +3815,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3673,9 +4111,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4310,94 +4745,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4405,473 +4817,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE,
-                               pgStatSock, -1L,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitLatchOrSocket(MyLatch,
-                               WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
-                               pgStatSock,
-                               2 * 1000L /* msec */ ,
-                               WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr & WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4888,7 +4858,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4900,38 +4870,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4942,6 +4905,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4974,55 +4939,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5030,8 +4959,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5058,24 +4987,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5109,102 +5045,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5213,7 +5212,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5221,49 +5220,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5277,7 +5257,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5286,76 +5266,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5365,59 +5302,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5430,14 +5358,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5450,25 +5378,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5484,25 +5408,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5518,7 +5437,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5526,304 +5445,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5842,797 +5472,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
 
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
+    if (pgStatSnapshotContext)
     {
-        char        statfile[MAXPGPATH];
+        MemoryContextReset(pgStatSnapshotContext);
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6721,7 +5579,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(int slru_idx)
 {
     /*
@@ -6732,7 +5590,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6742,41 +5600,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 30f3fc6ba1..6b7baa4ebf 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -254,7 +254,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -502,7 +501,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2706,8 +2693,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3070,8 +3055,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3138,13 +3121,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3217,22 +3193,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3693,22 +3653,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3904,8 +3848,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3929,8 +3871,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3940,8 +3881,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4142,8 +4082,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5116,18 +5054,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5246,12 +5172,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6146,7 +6066,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6202,8 +6121,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6438,7 +6355,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3b46bfe9ab..b6c7a8bc3c 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1550,8 +1550,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 29c920800a..32c511542d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2010,7 +2010,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2120,7 +2120,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2310,7 +2310,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2318,7 +2318,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..58a442f482 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -147,6 +147,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +264,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6985e8eed..cdd7095ebf 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 XactTruncationLock                    44
+StatsLock                            45
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7d667c6586..3535089755 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..0396aa469f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2aff739466..b9e78dd8e8 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1261,7 +1258,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1277,7 +1274,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1293,7 +1290,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1309,7 +1306,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1325,7 +1322,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1341,7 +1338,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1357,7 +1354,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1373,7 +1370,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1389,7 +1386,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1422,7 +1419,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1438,7 +1435,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1453,7 +1450,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1468,7 +1465,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1483,7 +1480,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1498,7 +1495,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1513,7 +1510,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1528,11 +1525,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1547,7 +1544,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1565,7 +1562,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1602,7 +1599,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1618,7 +1615,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1704,7 +1701,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
     int            i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1738,7 +1735,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats stat = stats[i];
+        PgStat_StatSLRUEntry    stat = stats[i];
         const char *name;
 
         name = pgstat_slru_name(i);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..51748c99ad 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cca9704d2d..430e997ca6 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -241,9 +241,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..f65d05c24c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1241,6 +1244,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2f3e0a70e0..f073025f69 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -747,8 +747,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1478,7 +1478,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1489,7 +1489,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1498,7 +1498,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ac02bd0c00..e813af7676 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -574,7 +574,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 208df557b8..4d07829ecd 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index cf0b463eba..de7baafb88 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c55dc1481c..eee9feb8f7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -113,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -158,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -183,308 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -516,98 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +246,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,19 +331,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -709,9 +352,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -727,7 +369,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -745,21 +387,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -808,7 +435,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1054,7 +680,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1251,13 +877,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1272,29 +900,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1456,8 +1081,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1466,13 +1091,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1483,5 +1111,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c04ae97148..ce2b79d1bc 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..2c65231b04 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.2

From c3315739a521a15fbf6deed13e2d288aca95bff8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v35 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 128 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 91 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..fad9f94ddb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9177,9 +9177,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index aca8f73a50..679135c6b6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7113,11 +7113,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7133,14 +7133,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7171,9 +7170,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8234,7 +8233,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8246,9 +8245,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 65c3fc62a9..56ef2f56f6 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2368,12 +2368,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 89662cc0a3..4973d69cfc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    master server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,12 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g. after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   The activity statistics is placed on shared memory. When the server shuts
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +213,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +611,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1060,10 +1051,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1788,6 +1775,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5739,9 +5730,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0807e912..ce5c60de10 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1287,11 +1287,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.2

From 25f3d543ff0efb30a329b9d10d47126304be8642 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v35 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index bdc9026c62..2885540362 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 679135c6b6..781556625c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7226,25 +7226,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index ea08d0b614..71a8b8b11a 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 76b59df408..9f24777fb7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -95,15 +95,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b6c7a8bc3c..8bd01c9047 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -269,7 +269,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -292,8 +291,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -327,13 +324,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f073025f69..d19cbb77c5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -198,7 +198,6 @@ static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4299,17 +4298,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11586,35 +11574,6 @@ check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e813af7676..21152e11c6 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -581,7 +581,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index eee9feb8f7..c50c73ef27 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1407359aef..25fde8a3a4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.2


Re: shared-memory based stats collector

От
Stephen Frost
Дата:
Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
> > Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche
>
> Hmm. This conflicts with 0fd2a79a63. Reabsed on it.

Thanks for working on this and keeping it updated!

I've started taking a look and at least right off...

> >From 4926e50e7635548f86dcd0d36cbf56d168a5d242 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
> Date: Mon, 16 Mar 2020 17:15:35 +0900
> Subject: [PATCH v35 1/7] Use standard crash handler in archiver.
>
> The commit 8e19a82640 changed SIGQUIT handler of almost all processes
> not to run atexit callbacks for safety. Archiver process should behave
> the same way for the same reason. Exit status changes 1 to 2 but that
> doesn't make any behavioral change.

Shouldn't this:

a) be back-patched, as the other change was
b) also include a change to have the stats collector (which I realize is
   removed later on in this patch set, but we're talking about fixing
   existing things..) for the same reason, and because there isn't much
   point in trying to write out the stats after we get a SIGQUIT, since
   we're just going to blow them away again since we're going to go
   through crash recovery..?

Might be good to have a separate thread to address these changes.

I've looked through (some of) this thread and through the patches also
and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

Thanks,

Stephen

Вложения

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche
> > 
> > Hmm. This conflicts with 0fd2a79a63. Reabsed on it.
> 
> Thanks for working on this and keeping it updated!
> 
> I've started taking a look and at least right off...
> 
> > >From 4926e50e7635548f86dcd0d36cbf56d168a5d242 Mon Sep 17 00:00:00 2001
> > From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
> > Date: Mon, 16 Mar 2020 17:15:35 +0900
> > Subject: [PATCH v35 1/7] Use standard crash handler in archiver.
> > 
> > The commit 8e19a82640 changed SIGQUIT handler of almost all processes
> > not to run atexit callbacks for safety. Archiver process should behave
> > the same way for the same reason. Exit status changes 1 to 2 but that
> > doesn't make any behavioral change.
> 
> Shouldn't this:
> 
> a) be back-patched, as the other change was
> b) also include a change to have the stats collector (which I realize is
>    removed later on in this patch set, but we're talking about fixing
>    existing things..) for the same reason, and because there isn't much
>    point in trying to write out the stats after we get a SIGQUIT, since
>    we're just going to blow them away again since we're going to go
>    through crash recovery..?
> 
> Might be good to have a separate thread to address these changes.

+1

> I've looked through (some of) this thread and through the patches also
> and hope to provide a review of the bits that should be targetting v14
> (unlike the above) soon.

Thanks. Current the patch is found to lead to write contention heavier
than the current stats collector when nearly a thousand backend
heavily write to the same table. I need to address that.

- Collect stats via sockets (in the same way as the current implement)
  and write/read to/from shared memory.

- Use our own lock-free (maybe) ring-buffer before stats-write enters
  lock-waiting mode, then new stats collector(!) process consumes the
  queue.

- Some other measures..

Anyway, I'll post a rebased version soon.

regards.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 08 Sep 2020 17:01:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in> > I've looked through (some of) this
threadand through the patches also
 
> > and hope to provide a review of the bits that should be targetting v14
> > (unlike the above) soon.
> 
> Thanks. Current the patch is found to lead to write contention heavier
> than the current stats collector when nearly a thousand backend
> heavily write to the same table. I need to address that.
> 
> - Collect stats via sockets (in the same way as the current implement)
>   and write/read to/from shared memory.
> 
> - Use our own lock-free (maybe) ring-buffer before stats-write enters
>   lock-waiting mode, then new stats collector(!) process consumes the
>   queue.
> 
> - Some other measures..
> 
> Anyway, I'll post a rebased version soon.

This is that. I'll continue seeking a way to solve the contention.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 0eb08bfeea9a20d50c5391edee600f420acfa4b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v36 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.
---
 src/backend/postmaster/pgarch.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..37be0e2bbb 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -96,7 +96,6 @@ static pid_t pgarch_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_exit(SIGNAL_ARGS);
 static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
@@ -229,7 +228,7 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGHUP, SignalHandlerForConfigReload);
     pqsignal(SIGINT, SIG_IGN);
     pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
-    pqsignal(SIGQUIT, pgarch_exit);
+    pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, pgarch_waken);
@@ -246,14 +245,6 @@ PgArchiverMain(int argc, char *argv[])
     exit(0);
 }
 
-/* SIGQUIT signal handler for archiver process */
-static void
-pgarch_exit(SIGNAL_ARGS)
-{
-    /* SIGQUIT means curl up and die ... */
-    exit(1);
-}
-
 /* SIGUSR1 signal handler for archiver process */
 static void
 pgarch_waken(SIGNAL_ARGS)
-- 
2.18.4

From 4a83b0e858f695ab855c3c1e184b8f235ee65521 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v36 2/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From 2d7795df24308e52340613dfe3f5106432ff1fee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v36 3/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 14bd15cad98e30d1e269087874f29465bd401a9f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v36 4/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 8f8734dc1d..2e6c322142 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 76b2f5066f..81bfaea869 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 37be0e2bbb..063d1323ea 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     pqsignal(SIGQUIT, SignalHandlerForCrashExit);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 42223c0f61..4c296dd214 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -554,6 +554,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1799,7 +1800,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3051,7 +3052,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3186,20 +3187,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3447,7 +3444,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3652,6 +3649,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3948,6 +3957,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5223,7 +5233,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5268,16 +5278,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5515,6 +5515,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index aa9fbd8054..9cc5226667 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From d59c9e2a9a1dba9e4e80e8b443963191c42e8d8c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v36 5/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 500ms and such large data is
serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 500ms. Locks
on the shared statistics is acquired by the units of such like tables,
functions so the expected chance of collision are not so
high. Furthermore, until 1 second has elapsed since the last flushing
to shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5128 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   59 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  581 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2256 insertions(+), 3875 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 92389e6666..82f7bbcca7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -599,7 +599,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..4b10efee81 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8576,9 +8576,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 117e3fdef7..622083b2c0 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1688,28 +1688,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..c5477ff567 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1b8cd7bacd..5abb96d2ba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1664,12 +1661,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1941,8 +1938,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1961,17 +1956,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2014,9 +2003,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2095,8 +2081,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2179,8 +2165,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2739,29 +2725,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2782,17 +2745,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2939,7 +2897,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2949,8 +2907,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 069e27e427..4382b1726f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -234,9 +234,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 624a3238b8..dd36b4965e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -350,7 +350,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -364,7 +364,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -486,13 +486,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -698,9 +698,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1245,8 +1245,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 063d1323ea..08fe87341c 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f4b168fd1..90c4ea31a0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,32 +35,23 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -73,35 +59,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    250 /* Initial retry interval after
+                                         * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +87,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,15 +101,24 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    dsa_pointer global_stats;    /* DSA pointer to global stats */
+    dsa_pointer archiver_stats; /* Ditto for archiver stats */
+    dsa_pointer slru_stats;        /* Ditto for SLRU stats */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
 
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
@@ -160,72 +139,159 @@ static const char *const slru_names[] = {
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                (dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry            (dynahash entry)
+ *      (direct pointer)-> PgStatEnvelope (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)    /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
+}            PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* databaseid */
+    Oid            objectid;        /* objectid */
+    size_t        len;            /* length of body, fixed per type. */
+    LWLock        lock;            /* lightweight lock to protect body */
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatEnvelope;
+
+#define PgStatEnvelopeSize(bodylen) \
+    (offsetof(PgStatEnvelope, body) + (bodylen))
+
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+} PgStatSharedSLRUStats;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    PgStatEnvelope *env;        /* pointer to stats envelope in heap */
+}            PgStatLocalHashEntry;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
  */
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+    PgStatHashEntryKey key;
+    bool        negative;
+    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
+}            PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)                \
+    (offsetof(PgStatSnapshot, body) + (bodylen))
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +328,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +340,19 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,526 +361,278 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStatEnvelope * get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                       bool nowait,
+                                       entry_initializer initfunc, bool *found);
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry);
+
+static bool flush_tabstat(PgStatEnvelope * env, bool nowait);
+static bool flush_funcstat(PgStatEnvelope * env, bool nowait);
+static bool flush_dbstat(PgStatEnvelope * env, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static void init_tabentry(PgStatEnvelope * env);
+static void init_funcentry(PgStatEnvelope * env);
+static void init_dbentry(PgStatEnvelope * env);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStatEnvelope * get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                                             bool create, bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level);
+static void pgstat_snapshot_global_stats(void);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
+    bool        found;
 
-#define TESTBYTEVAL ((char) 199)
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
-
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
-    {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
-    }
-
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (!IsUnderPostmaster)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        Assert(!found);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+}
 
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext and pgStatSnapshotContext, if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+
+    if (!pgStatSnapshotContext)
+        pgStatSnapshotContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Database statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
 
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
-        test_byte++;            /* just make sure variable is changed */
+    /*
+     * Don't use dsm under postmaster, or when not tracking counts.
+     */
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    pgstat_setup_memcxt();
 
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    if (area)
+        return;
 
-        /* If we get here, we have a working socket */
-        break;
-    }
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->global_stats =
+            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+        StatsShmem->archiver_stats =
+            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+        StatsShmem->slru_stats =
+            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
+        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
+
+        /* Load saved data if any. */
+        pgstat_read_statsfiles();
+
+        StatsShmem->refcount = 1;
     }
 
+    LWLockRelease(StatsLock);
+
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * If we're not the first process, attach existing shared stats area
+     * outside the StatsLock section.
      */
+    if (!area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+        /* Setup local variables */
+        pgStatLocalHash = NULL;
+        shared_globalStats = (PgStat_GlobalStats *)
+            dsa_get_address(area, StatsShmem->global_stats);
+        shared_archiverStats = (PgStat_ArchiverStats *)
+            dsa_get_address(area, StatsShmem->archiver_stats);
+        shared_SLRUStats = (PgStatSharedSLRUStats *)
+            dsa_get_address(area, StatsShmem->slru_stats);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
-
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
-
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
-
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
-
-    /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
-     */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+    global_snapshot_is_valid = false;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
-    {
-        int            nchars;
-        Oid            tmp_oid;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
+    if (--StatsShmem->refcount < 1)
+    {
         /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
+         * The process is the last one that is attaching the shared stats
+         * memory. Write out the stats files if requested.
          */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
+        if (write_stats)
+            pgstat_write_statsfiles();
 
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
     }
-    FreeDir(dir);
+
+    LWLockRelease(StatsLock);
+
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    shared_globalStats = NULL;
+    shared_archiverStats = NULL;
+    pgStatLocalHash = NULL;
+    global_snapshot_is_valid = false;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false); /* Don't write files. */
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +640,491 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->env->type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent->env, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->env);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent->env, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->env);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (shared_globalStats->stats_timestamp < now)
+        shared_globalStats->stats_timestamp = now;
+    LWLockRelease(StatsLock);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * If we have pending local stats, let the caller know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+    bool        found;
+
+    Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+    lstats = (PgStat_TableStatus *) &lenv->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+                           nowait, init_tabentry, &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;
+
+    /* retrieve the shared table stats entry from the envelope */
+    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /*
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
+     */
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shenv->lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/* ----------
+ * init_tabentry() -
+ *
+ * initializes table stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+init_tabentry(PgStatEnvelope * env)
 {
-    int            n;
-    int            len;
+    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+    /*
+     * If it's a new table entry, initialize counters to the values we just
+     * got.
+     */
+    Assert(env->type == PGSTAT_TYPE_TABLE);
+    tabent->tableid = env->objectid;
+    tabent->numscans = 0;
+    tabent->tuples_returned = 0;
+    tabent->tuples_fetched = 0;
+    tabent->tuples_inserted = 0;
+    tabent->tuples_updated = 0;
+    tabent->tuples_deleted = 0;
+    tabent->tuples_hot_updated = 0;
+    tabent->n_live_tuples = 0;
+    tabent->n_dead_tuples = 0;
+    tabent->changes_since_analyze = 0;
+    tabent->blocks_fetched = 0;
+    tabent->blocks_hit = 0;
+
+    tabent->vacuum_timestamp = 0;
+    tabent->vacuum_count = 0;
+    tabent->autovac_vacuum_timestamp = 0;
+    tabent->autovac_vacuum_count = 0;
+    tabent->analyze_timestamp = 0;
+    tabent->analyze_count = 0;
+    tabent->autovac_analyze_timestamp = 0;
+    tabent->autovac_analyze_count = 0;
+}
+
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStatEnvelope *shenv;        /* shared stats envelope */
+    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+                           nowait, init_funcentry, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shenv == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* retrieve the shared table stats entry from the envelope */
+    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->f_numcalls += localent->f_counts.f_numcalls;
+    sharedent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    sharedent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shenv->lock);
+
+    return true;
+}
+
+
+/* ----------
+ * init_funcentry() -
+ *
+ * initializes function stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
+ */
+static void
+init_funcentry(PgStatEnvelope * env)
+{
+    PgStat_StatFuncEntry *shstat = (PgStat_StatFuncEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_FUNCTION);
+    shstat->functionid = env->objectid;
+    shstat->f_numcalls = 0;
+    shstat->f_total_time = 0;
+    shstat->f_self_time = 0;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStatEnvelope *shenv;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &env->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+                           nowait, init_dbentry, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!shenv)
+        return false;
+
+    /* retrieve the shared stats entry from the envelope */
+    sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(localent->databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,257 +1132,149 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&shenv->lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+
+/* ----------
+ * init_dbentry() -
+ *
+ * initializes database stats entry
+ * This is also used as initialization callback for get_stat_entry.
+ * ----------
  */
 static void
-pgstat_send_funcstats(void)
+init_dbentry(PgStatEnvelope * env)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &env->body;
+
+    Assert(env->type == PGSTAT_TYPE_DB);
+    dbentry->databaseid = env->databaseid;
+    dbentry->last_autovac_time = 0;
+    dbentry->last_checksum_failure = 0;
+    dbentry->stat_reset_timestamp = 0;
+    dbentry->stats_timestamp = 0;
+    /* initialize the new shared entry */
+    MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
+}
+
 
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    int i;
 
-    if (pgStatFunctions == NULL)
-        return;
+    if (!have_slrustats)
+        return true;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shared_SLRUStats->lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
+        PgStat_StatSLRUEntry *sharedent = &shared_SLRUStats->entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
 
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+    LWLockRelease(&shared_SLRUStats->lock);
 
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
+    have_slrustats = false;
 
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
+    return true;
+}
 
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
 
-    have_function_stats = false;
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int            listlen = 16;
+    PgStatEnvelope **envlist = palloc(sizeof(PgStatEnvelope * *) * listlen);
+    int            n = 0;
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
         }
+        envlist[n++] = dsa_get_address(area, p->env);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
+    envlist[n] = NULL;
 
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return envlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1231,7 +1283,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1245,7 +1297,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1272,65 +1324,185 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatEnvelope **victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatEnvelope * *) * arraylen);
+        }
+
+        victims[nvictims++] = dsa_get_address(area, ent->env);
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatEnvelope *p = victims[i];
+
+        delete_stat_entry(p->type, p->databaseid, p->objectid, true);
+    }
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->env);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    for (p = envlist; *p != NULL; p++)
+        delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pfree(envlist);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1511,47 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = envlist; *p != NULL; p++)
+    {
+        PgStatEnvelope *env = *p;
+        PgStat_StatDBEntry *dbstat;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+
+        switch (env->type)
+        {
+            case PGSTAT_TYPE_TABLE:
+                init_tabentry(env);
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                init_funcentry(env);
+                break;
+
+            case PGSTAT_TYPE_DB:
+                init_dbentry(env);
+                dbstat = (PgStat_StatDBEntry *) &env->body;
+                dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+                break;
+            default:
+                Assert(false);
+        }
+
+        LWLockRelease(&env->lock);
+    }
+
+    pfree(envlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1560,37 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1599,42 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    env = get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid,
+                         false, NULL, NULL);
+    Assert(env);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&env->lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
+    LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+    if (env->type == PGSTAT_TYPE_TABLE)
+        init_tabentry(env);
+    else
+    {
+        Assert(env->type == PGSTAT_TYPE_FUNCTION);
+        init_funcentry(env);
+    }
+    LWLockRelease(&env->lock);
 }
 
 /* ----------
@@ -1418,15 +1650,27 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+        shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&shared_SLRUStats[i], 0, sizeof(PgStat_StatSLRUEntry));
+            shared_SLRUStats->entry[i].stat_reset_timestamp = ts;
+        }
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1762,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,158 +1783,172 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
 {
+    PgStatEnvelope *env;
     PgStat_BackendFunctionEntry *htabent;
     bool        found;
 
@@ -1685,26 +1959,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               fcinfo->flinfo->fn_oid, true, &found);
+    htabent = (PgStat_BackendFunctionEntry *) &env->body;
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1717,31 +1980,38 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
+    PgStatEnvelope *env;
+
+    env = get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                               func_id, false, NULL);
+    if (!env)
         return NULL;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    return (PgStat_BackendFunctionEntry *) &env->body;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2052,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2063,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1813,7 +2079,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1829,116 +2096,157 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize body part of the returned envelope.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStatEnvelope *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
+
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
+    }
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+    {
+        int            len = pgstat_localentsize[type];
+
+        entry->env = MemoryContextAlloc(TopMemoryContext,
+                                        PgStatEnvelopeSize(len));
+        entry->env->type = type;
+        entry->env->len = len;
     }
 
+    return entry->env;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStatEnvelope *env;
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    env = get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                               true, &found);
+    dbentry = (PgStat_StatDBEntry *) &env->body;
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->databaseid = dbid;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
+    return dbentry;
+}
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStatEnvelope *env;
+    PgStat_TableStatus *tabentry;
+    bool        found;
 
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                               isshared ? InvalidOid : MyDatabaseId,
+                               rel_id, true, &found);
 
-    return entry;
+    tabentry = (PgStat_TableStatus *) &env->body;
+
+    if (!found)
+    {
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
+    }
+
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStatEnvelope *env;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    env = get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                               false, NULL);
+    if (!env)
+        env = get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                   false, NULL);
+    if (env)
+        return (PgStat_TableStatus *) &env->body;
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return NULL;
 }
 
 /*
@@ -2369,8 +2677,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,7 +2723,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2451,7 +2759,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2468,88 +2776,176 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 }
 
 
+/* ----------
+ * snapshot_statentry() -
+ *
+ *  Common routine for functions pgstat_fetch_stat_*entry()
+ *
+ *  Returns the pointer to the snapshot of the shared entry for the key or NULL
+ *  if not found. Returned snapshots are stable during the current transaction
+ *  or until pgstat_clear_snapshot() is called.
+ *
+ *  Created snapshots are stored in pgStatSnapshotHash.
+ */
+static void *
+snapshot_statentry(const PgStatTypes type, const Oid dbid, const Oid objid)
+{
+    PgStatSnapshot *snap = NULL;
+    bool        found;
+    PgStatHashEntryKey key;
+    size_t        statentsize = pgstat_entsize[type];
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    /*
+     * Create new hash, with rather arbitrary initial number of entries since
+     * we don't know how this hash will grow.
+     */
+    if (!pgStatSnapshotHash)
+    {
+        HASHCTL        ctl;
+
+        /*
+         * Create the hash in the stats context
+         *
+         * The entry is prepended by common header part represented by
+         * PgStatSnapshot.
+         */
+
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = PgStatSnapshotSize(statentsize);
+        ctl.hcxt = pgStatSnapshotContext;
+        pgStatSnapshotHash = hash_create("pgstat snapshot hash", 32, &ctl,
+                                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    }
+
+    /* Find a snapshot  */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    snap = hash_search(pgStatSnapshotHash, &key, HASH_ENTER, &found);
+
+    /*
+     * Refer shared hash if not found in the snapshot hash.
+     *
+     * In transaction state, it is obvious that we should create a snapshot
+     * entriy for consistency. If we are not, we return an up-to-date entry.
+     * Having said that, we need a snapshot since shared stats entry can be
+     * modified anytime. We share the same snapshot entry for the purpose.
+     */
+    if (!found || !IsTransactionState())
+    {
+        PgStatEnvelope *shenv;
+
+        shenv = get_stat_entry(type, dbid, objid, true, NULL, NULL);
+
+        if (shenv)
+            memcpy(&snap->body, &shenv->body, statentsize);
+
+        snap->negative = !shenv;
+    }
+
+    if (snap->negative)
+        return NULL;
+
+    return &snap->body;
+}
+
+
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends. The returned entries are cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    /* If not done for this transaction, take a snapshot of global stats */
+    pgstat_snapshot_global_stats();
+
+    /* caller doesn't have a business with snapshot-local members */
+    return (PgStat_StatDBEntry *)
+        snapshot_statentry(PGSTAT_TYPE_DB, dbid, InvalidOid);
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    return (PgStat_StatTabEntry *)
+        snapshot_statentry(PGSTAT_TYPE_TABLE, dboid, reloid);
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2563,25 +2959,60 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    return (PgStat_StatFuncEntry *)
+        snapshot_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id);
+}
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
+/*
+ * pgstat_snapshot_global_stats() -
+ *
+ * Makes a snapshot of global stats if not done yet.  They will be kept until
+ * subsequent call of pgstat_clear_snapshot() or the end of the current
+ * memory context (typically TopTransactionContext).
+ * ----------
+ */
+static void
+pgstat_snapshot_global_stats(void)
+{
+    MemoryContext oldcontext;
+    int    i;
+
+    attach_shared_stats();
+
+    /* Nothing to do if already done */
+    if (global_snapshot_is_valid)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(pgStatSnapshotContext);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+
+    memcpy(&snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
+    memcpy(&snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_SLRUStats, shared_SLRUStats,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    LWLockRelease(StatsLock);
+
+    /* Fill in empty timestamp of SLRU stats  */
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
+        if (snapshot_SLRUStats[i].stat_reset_timestamp == 0)
+            snapshot_SLRUStats[i].stat_reset_timestamp =
+                snapshot_globalStats.stat_reset_timestamp;
     }
+    global_snapshot_is_valid = true;
 
-    return funcentry;
+    MemoryContextSwitchTo(oldcontext);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2652,9 +3083,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &archiverStats;
+    return &snapshot_archiverStats;
 }
 
 
@@ -2669,9 +3101,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    /* If not done for this transaction, take a stats snapshot */
+    pgstat_snapshot_global_stats();
 
-    return &globalStats;
+    return &snapshot_globalStats;
 }
 
 
@@ -2683,12 +3116,12 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    pgstat_snapshot_global_stats();
 
-    return slruStats;
+    return snapshot_SLRUStats;
 }
 
 
@@ -2902,8 +3335,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3079,12 +3512,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3097,7 +3533,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3114,6 +3550,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3374,7 +3812,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3669,9 +4108,6 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
-            break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
             break;
@@ -4324,94 +4760,71 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = now;
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    shared_globalStats->timed_checkpoints += l->timed_checkpoints;
+    shared_globalStats->requested_checkpoints += l->requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += l->checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += l->checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += l->buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += l->buf_written_clean;
+    shared_globalStats->maxwritten_clean += l->maxwritten_clean;
+    shared_globalStats->buf_written_backend += l->buf_written_backend;
+    shared_globalStats->buf_fsync_backend += l->buf_fsync_backend;
+    shared_globalStats->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4419,477 +4832,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4906,7 +4873,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4918,38 +4885,31 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(*shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Write SLRU stats struct
-     */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(*shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &(*penv)->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4960,6 +4920,8 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(envlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4992,55 +4954,19 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatEnvelope **envlist;
+    PgStatEnvelope **penv;
     FILE       *fpout;
     int32        format_id;
     Oid            dbid = dbentry->databaseid;
@@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+    for (penv = envlist; *penv != NULL; penv++)
     {
+        PgStat_StatFuncEntry *funcentry =
+        (PgStat_StatFuncEntry *) &(*penv)->body;
+
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(envlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5127,102 +5060,165 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
 
-    if (permanent)
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, init_dbentry, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new envelope. If found is not NULL, it is set to true if
+ *  existing entry is found or false if not.
+ * ----------
+ */
+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+               bool nowait, entry_initializer initfunc, bool *found)
+{
+    bool        create = (initfunc != NULL);
+    PgStatHashEntry *shent;
+    PgStatEnvelope *shenv = NULL;
+    PgStatHashEntryKey key;
+    bool        myfound;
+
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    shent = dshash_find_extended(pgStatSharedHash, &key,
+                                 create, nowait, create, &myfound);
+    if (shent)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        if (create && !myfound)
+        {
+            /* Create new stats envelope. */
+            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+            dsa_pointer chunk = dsa_allocate0(area, envsize);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+            shenv = dsa_get_address(area, chunk);
+            shenv->type = type;
+            shenv->databaseid = dbid;
+            shenv->objectid = objid;
+            shenv->len = pgstat_entsize[type];
+            LWLockInitialize(&shenv->lock, LWTRANCHE_STATS);
+
+            /*
+             * The lock on dshsh is released just after. Call initializer
+             * callback before it is exposed to other process.
+             */
+            if (initfunc)
+                initfunc(shenv);
+
+            /* Link the new entry from the hash entry. */
+            shent->env = chunk;
+        }
+        else
+            shenv = dsa_get_address(area, shent->env);
+
+        dshash_release_lock(pgStatSharedHash, shent);
     }
+
+    if (found)
+        *found = myfound;
+
+    return shenv;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
+    PgStatEnvelope *env;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp =
+        shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5231,7 +5227,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5239,49 +5235,30 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(*shared_globalStats), fpin) !=
+        sizeof(*shared_globalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(*shared_archiverStats), fpin) !=
+        sizeof(*shared_archiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5295,7 +5272,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5304,76 +5281,33 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                env = get_stat_entry(PGSTAT_TYPE_DB, dbbuf.databaseid,
+                                     InvalidOid,
+                                     false, init_dbentry, &found);
+                dbentry = (PgStat_StatDBEntry *) &env->body;
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbentry);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5383,59 +5317,50 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
+    PgStatEnvelope *env;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
     PgStat_StatFuncEntry funcbuf;
     PgStat_StatFuncEntry *funcentry;
+    dshash_table *tabhash = NULL;
+    dshash_table *funchash = NULL;
     FILE       *fpin;
     int32        format_id;
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbentry->databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5448,14 +5373,14 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
@@ -5468,25 +5393,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE,
+                                     dbentry->databaseid, tabbuf.tableid,
+                                     false, init_tabentry, &found);
+                tabentry = (PgStat_StatTabEntry *) &env->body;
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5502,25 +5423,20 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }

-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                env = get_stat_entry(PGSTAT_TYPE_TABLE, dbentry->databaseid,
+                                     funcbuf.functionid,
+                                     false, init_funcentry, &found);
+                funcentry = (PgStat_StatFuncEntry *) &env->body;
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5536,7 +5452,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5544,304 +5460,15 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     }
 
 done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
+    if (tabhash)
+        dshash_detach(tabhash);
+    if (funchash)
+        dshash_detach(funchash);
 
-done:
     FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
 
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5860,797 +5487,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
 
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
+    if (pgStatSnapshotContext)
     {
-        char        statfile[MAXPGPATH];
+        MemoryContextReset(pgStatSnapshotContext);
 
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        /* Reset variables that pointed to the context */
+        global_snapshot_is_valid = false;
+        pgStatSnapshotHash = NULL;
     }
 }
 
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
-}
-
 /*
  * Convert a potentially unsafely truncated activity string (see
  * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
@@ -6739,7 +5594,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(int slru_idx)
 {
     /*
@@ -6750,7 +5605,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6760,41 +5615,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 4c296dd214..d10bba75f0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -256,7 +256,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -517,7 +516,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1339,12 +1337,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1793,11 +1785,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2725,8 +2712,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3053,8 +3038,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3121,13 +3104,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3200,22 +3176,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3676,22 +3636,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3911,8 +3855,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3936,8 +3878,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3947,8 +3888,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4149,8 +4089,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5123,18 +5061,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5253,12 +5179,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6159,7 +6079,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6215,8 +6134,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6451,7 +6368,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6064384e32..cc6b2bb5de 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1553,8 +1553,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..61fa52ed66 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2045,7 +2045,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2155,7 +2155,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2345,7 +2345,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2353,7 +2353,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index c9424f167c..0396aa469f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4185,11 +4192,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4221,6 +4229,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4233,8 +4243,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4269,7 +4284,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4277,6 +4292,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 95738a4e34..3e8ce0b3bf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1710,7 +1707,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
     int            i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1744,7 +1741,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats stat = stats[i];
+        PgStat_StatSLRUEntry    stat = stats[i];
         const char *name;
 
         name = pgstat_slru_name(i);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index cf8f9579c3..ecd0cec2c3 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -243,9 +243,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..36113378f7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..29a8737498 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -113,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -158,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -183,308 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -516,98 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +246,72 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    Oid            databaseid;
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
     Oid            tableid;
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,19 +331,15 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
@@ -709,9 +352,8 @@ typedef struct PgStat_StatFuncEntry
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -727,7 +369,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -745,21 +387,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -808,7 +435,6 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1060,7 +686,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1257,13 +883,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +906,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1462,8 +1087,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,13 +1097,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1489,5 +1117,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 9016c2aaadb8c82d3292ad729b8e75c8c5a4d340 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v36 6/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 508bea3bc6..d30491d4f6 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9208,9 +9208,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..530f41c194 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7190,11 +7190,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7210,14 +7210,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7248,9 +7247,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8311,7 +8310,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8323,9 +8322,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index beb309e668..45ec7ce68f 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2366,12 +2366,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 673a0e73e4..dbf439891d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +610,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1057,10 +1047,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1815,6 +1801,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5738,9 +5728,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0b2e2de87b..ad88efdfee 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1288,11 +1288,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 980614004a05151ca57eabff61c787300a27ff8b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b9331830f7..5096963234 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 530f41c194..7f2f18f294 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7303,25 +7303,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 90c4ea31a0..b07add0a4f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -96,15 +96,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 typedef struct StatsShmemStruct
 {
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index cc6b2bb5de..357008915d 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -261,7 +261,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +283,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -318,13 +315,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 36113378f7..70c573dcaa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4309,17 +4308,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11608,35 +11596,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 29a8737498..9fa87de887 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..bb5474b878 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 08 Sep 2020 17:55:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 08 Sep 2020 17:01:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in> > I've looked through (some of)
thisthread and through the patches also
 
> > > and hope to provide a review of the bits that should be targetting v14
> > > (unlike the above) soon.
> > 
> > Thanks. Current the patch is found to lead to write contention heavier
> > than the current stats collector when nearly a thousand backend
> > heavily write to the same table. I need to address that.
> > 
> > - Collect stats via sockets (in the same way as the current implement)
> >   and write/read to/from shared memory.
> > 
> > - Use our own lock-free (maybe) ring-buffer before stats-write enters
> >   lock-waiting mode, then new stats collector(!) process consumes the
> >   queue.
> > 
> > - Some other measures..

- Make dshash search less frequent. To find the actual source of the
  contension, We're going to measure performance with the attached on
  top of the latest patch let sessions cache the result of dshash
  searches for the session lifetime. (table-dropping vacuum clears the
  local hash.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 0ce7b7a891326a3fdcf85b001586f721bef569f9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 9 Sep 2020 16:44:57 +0900
Subject: [PATCH v36 8/8] Experimental local cache for dshash.

This patch suffers LWLock contension than unpatched while very many
sessions commits frequently. As an effort to identify the source, this
patch caches dshash information in a local hash to avoid dshash locks.
---
 src/backend/postmaster/pgstat.c | 67 +++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b07add0a4f..caa067e9fe 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -35,6 +35,7 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -112,6 +113,8 @@ typedef struct StatsShmemStruct
     dsa_pointer slru_stats;        /* Ditto for SLRU stats */
     int            refcount;        /* # of processes that is attaching the shared
                                  * stats memory */
+    pg_atomic_uint64 del_ent_count; /* # of entries deleted. not protected by
+                                       StatsLock */
 }            StatsShmemStruct;
 
 /* BgWriter global statistics counters */
@@ -254,6 +257,7 @@ typedef struct PgStatSharedSLRUStats
 typedef struct PgStatLocalHashEntry
 {
     PgStatHashEntryKey key;        /* hash key */
+    char               status;    /* for simplehash use */
     PgStatEnvelope *env;        /* pointer to stats envelope in heap */
 }            PgStatLocalHashEntry;
 
@@ -280,6 +284,18 @@ static const dshash_parameters dsh_rootparams = {
     LWTRANCHE_STATS
 };
 
+/* define hashtable for dshash caching */
+#define SH_PREFIX pgstat_cache
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashEntryKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) hash_bytes((unsigned char *)&key, sizeof(PgStatHashEntryKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashEntryKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /* The shared hash to index activity stats entries. */
 static dshash_table *pgStatSharedHash = NULL;
 
@@ -326,6 +342,8 @@ typedef struct TwoPhasePgStatRecord
 } TwoPhasePgStatRecord;
 
 /* Variables for backend status snapshot */
+static pgstat_cache_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;
 static MemoryContext pgStatLocalContext = NULL;
 static MemoryContext pgStatSnapshotContext = NULL;
 static HTAB *pgStatSnapshotHash = NULL;
@@ -437,6 +455,7 @@ StatsShmemInit(void)
         Assert(!found);
 
         StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        pg_atomic_init_u64(&StatsShmem->del_ent_count, 0);
     }
 }
 
@@ -1432,6 +1451,9 @@ pgstat_vacuum_stat(void)
 
         delete_stat_entry(p->type, p->databaseid, p->objectid, true);
     }
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
 }
 
 
@@ -1493,6 +1515,8 @@ pgstat_drop_database(Oid databaseid)
         delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, true);
 
     pfree(envlist);
+
+    pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
 }
 
 
@@ -5119,6 +5143,7 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
 {
     bool        create = (initfunc != NULL);
     PgStatHashEntry *shent;
+    PgStatLocalHashEntry *loent;
     PgStatEnvelope *shenv = NULL;
     PgStatHashEntryKey key;
     bool        myfound;
@@ -5128,6 +5153,31 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
     key.type = type;
     key.databaseid = dbid;
     key.objectid = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        currage = pg_atomic_read_u64(&StatsShmem->del_ent_count);
+
+        if (currage == pgStatEntHashAge)
+        {
+            loent = pgstat_cache_lookup(pgStatEntHash, key);
+
+            if (loent)
+            {
+                if (found)
+                    *found = true;
+                return loent->env;
+            }
+        }
+        else
+        {
+            pgstat_cache_destroy(pgStatEntHash);
+            pgStatEntHash = NULL;
+        }
+    }
+
     shent = dshash_find_extended(pgStatSharedHash, &key,
                                  create, nowait, create, &myfound);
     if (shent)
@@ -5159,6 +5209,23 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
             shenv = dsa_get_address(area, shent->env);
 
         dshash_release_lock(pgStatSharedHash, shent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || CacheMemoryContext)
+        {
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_cache_create(CacheMemoryContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->del_ent_count);
+            }
+    
+            loent = pgstat_cache_insert(pgStatEntHash, key, &myfound);
+            Assert(!myfound);
+            loent->env = shenv;
+        }
     }
 
     if (found)
-- 
2.18.4


Re: shared-memory based stats collector

От
Stephen Frost
Дата:
Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > Shouldn't this:
> >
> > a) be back-patched, as the other change was
> > b) also include a change to have the stats collector (which I realize is
> >    removed later on in this patch set, but we're talking about fixing
> >    existing things..) for the same reason, and because there isn't much
> >    point in trying to write out the stats after we get a SIGQUIT, since
> >    we're just going to blow them away again since we're going to go
> >    through crash recovery..?
> >
> > Might be good to have a separate thread to address these changes.
>
> +1

Just FYI, Tom's started a thread which includes this over here-

https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us

Thanks,

Stephen

Вложения

Re: shared-memory based stats collector

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> Just FYI, Tom's started a thread which includes this over here-
> https://postgr.es/m/1850884.1599601164@sss.pgh.pa.us

Per that discussion, I'm about to go and commit the 0001 patch from
this thread, which will cause the cfbot to not be able to apply the
patchset anymore till you repost it without 0001.  However, before
reposting, you might want to fix the compile errors the cfbot is
showing currently.

On the Windows side:

src/backend/postmaster/postmaster.c(6410): error C2065: 'pgStatSock' : undeclared identifier
[C:\projects\postgresql\postgres.vcxproj]

On the Linux side:

1711pgstat.c: In function ‘pgstat_vacuum_stat’:
1712pgstat.c:1411:7: error: ‘key’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1713   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1714       ^
1715pgstat.c:1373:8: note: ‘key’ was declared here
1716   Oid     *key;
1717        ^
1718pgstat.c:1411:7: error: ‘oidtab’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1719   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1720       ^
1721pgstat.c:1372:9: note: ‘oidtab’ was declared here
1722   HTAB    *oidtab;
1723         ^
1724pgstat.c: In function ‘pgstat_reset_single_counter’:
1725pgstat.c:1625:6: error: ‘stattype’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1726  env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
1727      ^
1728pgstat.c:1601:14: note: ‘stattype’ was declared here
1729  PgStatTypes stattype;
1730              ^
1731cc1: all warnings being treated as errors

            regards, tom lane



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> Locks on the shared statistics is acquired by the units of such like
> tables, functions so the expected chance of collision are not so high.

I can't really parse that...


> Furthermore, until 1 second has elapsed since the last flushing to
> shared stats, lock failure postpones stats flushing so that lock
> contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

>
>      /*
> -     * Clean up any dead statistics collector entries for this DB. We always
> +     * Clean up any dead activity statistics entries for this DB. We always
>       * want to do this exactly once per DB-processing cycle, even if we find
>       * nothing worth vacuuming in the database.
>       */

What is "activity statistics"?


> @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
>      }
>
>      /* fetch the pgstat table entry */
> -    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> -                                         shared, dbentry);
> +    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> +                                                   relid);

Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory?  There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?


> +#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
>
> +#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
> +                                             * updates */

These don't really seem to be in line with the commit message...




>  /*
> - * Structures in which backends store per-table info that's waiting to be
> - * sent to the collector.
> + * Enums and types to define shared statistics structure.
> + *
> + * Statistics entries for each object is stored in individual DSA-allocated
> + * memory. Every entry is pointed from the dshash pgStatSharedHash via
> + * dsa_pointer. The structure makes object-stats entries not moved by dshash
> + * resizing, and allows the dshash can release lock sooner on stats
> + * updates. Also it reduces interfering among write-locks on each stat entry by
> + * not relying on partition lock of dshash. PgStatLocalHashEntry is the
> + * local-stats equivalent of PgStatHashEntry for shared stat entries.
> + *
> + * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
> + * attribute of all kind of statistics and a LWLock lock object.
> + *
> + * Shared stats are stored as:
> + *
> + * dshash pgStatSharedHash
> + *    -> PgStatHashEntry                (dshash entry)
> + *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)

I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.


>  /*
> - * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
> + * entry size lookup table of shared statistics entries corresponding to
> + * PgStatTypes
>   */
> -typedef struct TabStatHashEntry
> +static size_t pgstat_entsize[] =

> +/* Ditto for local statistics entries */
> +static size_t pgstat_localentsize[] =
> +{
> +    0,                            /* PGSTAT_TYPE_ALL: not an entry */
> +    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> +    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> +    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> +};

These probably should be const as well.


>  /*
> - * Backends store per-function info that's waiting to be sent to the collector
> - * in this hash table (indexed by function OID).
> + * Stats numbers that are waiting for flushing out to shared stats are held in
> + * pgStatLocalHash,
>   */
> -static HTAB *pgStatFunctions = NULL;
> +typedef struct PgStatHashEntry
> +{
> +    PgStatHashEntryKey key;        /* hash key */
> +    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
> +}            PgStatHashEntry;
> +
> +/* struct for shared statistics entry pointed from shared hash entry. */
> +typedef struct PgStatEnvelope
> +{
> +    PgStatTypes type;            /* statistics entry type */
> +    Oid            databaseid;        /* databaseid */
> +    Oid            objectid;        /* objectid */

Do we need this information both here and in PgStatHashEntry? It's
possible that it's worthwhile, but I am not sure it is.


> +    size_t        len;            /* length of body, fixed per type. */

Why do we need this? Isn't that something that can easily be looked up
using the type?


> +    LWLock        lock;            /* lightweight lock to protect body */
> +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> +}            PgStatEnvelope;

What you're doing here with 'body' doesn't provide enough guarantees
about proper alignment. E.g. if one of the entry types wants to store a
double, this won't portably work, because there's platforms that have 4
byte alignment for ints, but 8 byte alignment for doubles.


Wouldn't it be better to instead embed PgStatEnvelope into the struct
that's actually stored? E.g. something like

struct PgStat_TableStatus
{
    PgStatEnvelope header; /* I'd rename the type */
    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
    ...
}

or if you don't want to do that because it'd require declaring
PgStatEnvelope in the header (not sure that'd really be worth avoiding),
you could just get rid of the body field and just do the calculation
using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))


> + * Snapshot is stats entry that is locally copied to offset stable values for a
> + * transaction.
>   */
> -static bool have_function_stats = false;
> +typedef struct PgStatSnapshot
> +{
> +    PgStatHashEntryKey key;
> +    bool        negative;
> +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> +}            PgStatSnapshot;
> +
> +#define PgStatSnapshotSize(bodylen)                \
> +    (offsetof(PgStatSnapshot, body) + (bodylen))
>
> -/*
> - * Info about current "snapshot" of stats file
> - */
> +/* Variables for backend status snapshot */
>  static MemoryContext pgStatLocalContext = NULL;
> -static HTAB *pgStatDBHash = NULL;
> +static MemoryContext pgStatSnapshotContext = NULL;
> +static HTAB *pgStatSnapshotHash = NULL;

>  /*
> - * Cluster wide statistics, kept in the stats collector.
> - * Contains statistics that are not collected per database
> - * or per table.
> + * Cluster wide statistics.
> + *
> + * Contains statistics that are collected not per database nor per table
> + * basis.  shared_* points to shared memory and snapshot_* are backend
> + * snapshots.
>   */
> -static PgStat_ArchiverStats archiverStats;
> -static PgStat_GlobalStats globalStats;
> -static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
> -
> -/*
> - * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
> - * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
> - * will write both that DB's data and the shared stats.
> - */
> -static List *pending_write_requests = NIL;
> +static bool global_snapshot_is_valid = false;
> +static PgStat_ArchiverStats *shared_archiverStats;
> +static PgStat_ArchiverStats snapshot_archiverStats;
> +static PgStat_GlobalStats *shared_globalStats;
> +static PgStat_GlobalStats snapshot_globalStats;
> +static PgStatSharedSLRUStats *shared_SLRUStats;
> +static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];

The amount of code needed for this snapshot stuff seems unreasonable to
me, especially because I don't see why we really need it. Is this just
so that there's no skew between all the columns of pg_stat_all_tables()
etc?

I think this needs a lot more comments explaining what it's trying to
achieve.


> +/*
> + * Newly created shared stats entries needs to be initialized before the other
> + * processes get access it. get_stat_entry() calls it for the purpose.
> + */
> +typedef void (*entry_initializer) (PgStatEnvelope * env);

I think we should try to not need it, instead declaring that all fields
are zero initialized. That fits well together with my suggestion to
avoid duplicating the database / object ids.


> +static void
> +attach_shared_stats(void)
> +{
...
> +        /* We're the first process to attach the shared stats memory */
> +        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
> +
> +        /* Initialize shared memory area */
> +        area = dsa_create(LWTRANCHE_STATS);
> +        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
> +
> +        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
> +        StatsShmem->global_stats =
> +            dsa_allocate0(area, sizeof(PgStat_GlobalStats));
> +        StatsShmem->archiver_stats =
> +            dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
> +        StatsShmem->slru_stats =
> +            dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
> +        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
> +
> +        shared_globalStats = (PgStat_GlobalStats *)
> +            dsa_get_address(area, StatsShmem->global_stats);
> +        shared_archiverStats = (PgStat_ArchiverStats *)
> +            dsa_get_address(area, StatsShmem->archiver_stats);
> +
> +        shared_SLRUStats = (PgStatSharedSLRUStats *)
> +            dsa_get_address(area, StatsShmem->slru_stats);
> +        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);

I don't think it makes sense to use dsa allocations for any of the fixed
size stats (global_stats, archiver_stats, ...). They should just be
direct members of StatsShmem? Then we also don't need the shared_*
helper variables


> +        /* Load saved data if any. */
> +        pgstat_read_statsfiles();

Hm. Is it a good idea to do this as part of the shmem init function?
That's a lot more work than we normally do in these.


> +/* ----------
> + * detach_shared_stats() -
> + *
> + *    Detach shared stats. Write out to file if we're the last process and told
> + *    to do so.
> + * ----------
>   */
>  static void
> -pgstat_reset_remove_files(const char *directory)
> +detach_shared_stats(bool write_stats)

I think it'd be better to have an explicit call in the shutdown sequence
somewhere to write out the data, instead of munging detach and writing
stats out together.


>  /* ----------
>   * pgstat_report_stat() -
>   *
>   *    Must be called by processes that performs DML: tcop/postgres.c, logical
> - *    receiver processes, SPI worker, etc. to send the so far collected
> - *    per-table and function usage statistics to the collector.  Note that this
> - *    is called only when not within a transaction, so it is fair to use
> + *    receiver processes, SPI worker, etc. to apply the so far collected
> + *    per-table and function usage statistics to the shared statistics hashes.
> + *
> + *    Updates are applied not more frequent than the interval of
> + *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
> + *    failure if force is false and there's no pending updates longer than
> + *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
> + *    succeeding calls of this function.
> + *
> + *    Returns the time until the next timing when updates are applied in
> + *    milliseconds if there are no updates held for more than
> + *    PGSTAT_MIN_INTERVAL milliseconds.
> + *
> + *    Note that this is called only out of a transaction, so it is fine to use
>   *    transaction stop time as an approximation of current time.
> - * ----------
> + *    ----------
>   */
> -void
> +long
>  pgstat_report_stat(bool force)
>  {
> -    /* we assume this inits to all zeroes: */
> -    static const PgStat_TableCounts all_zeroes;
> -    static TimestampTz last_report = 0;
> -
> +    static TimestampTz next_flush = 0;
> +    static TimestampTz pending_since = 0;
> +    static long retry_interval = 0;
>      TimestampTz now;
> -    PgStat_MsgTabstat regular_msg;
> -    PgStat_MsgTabstat shared_msg;
> -    TabStatusArray *tsa;
> +    bool        nowait = !force;    /* Don't use force ever after */

> +    if (nowait)
> +    {
> +        /*
> +         * Don't flush stats too frequently.  Return the time to the next
> +         * flush.
> +         */

I think it's confusing to use nowait in the if when you actually mean
!force.


> -    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> +    if (pgStatLocalHash)
>      {
> -        for (i = 0; i < tsa->tsa_used; i++)
> +        /* Step 1: flush out other than database stats */
> +        hash_seq_init(&scan, pgStatLocalHash);
> +        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
>          {
> -            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
> -            PgStat_MsgTabstat *this_msg;
> -            PgStat_TableEntry *this_ent;
> +            bool        remove = false;
>
> -            /* Shouldn't have any pending transaction-dependent counts */
> -            Assert(entry->trans == NULL);
> +            switch (lent->env->type)
> +            {
> +                case PGSTAT_TYPE_DB:
> +                    if (ndbentries >= dbentlistlen)
> +                    {
> +                        dbentlistlen *= 2;
> +                        dbentlist = repalloc(dbentlist,
> +                                             sizeof(PgStatLocalHashEntry *) *
> +                                             dbentlistlen);
> +                    }
> +                    dbentlist[ndbentries++] = lent;
> +                    break;

Why do we need this special behaviour for database statistics?

If we need it,it'd be better to just use List here rather than open
coding a replacement (List these days basically has the same complexity
as what you do here).


> +                case PGSTAT_TYPE_TABLE:
> +                    if (flush_tabstat(lent->env, nowait))
> +                        remove = true;
> +                    break;
> +                case PGSTAT_TYPE_FUNCTION:
> +                    if (flush_funcstat(lent->env, nowait))
> +                        remove = true;
> +                    break;
> +                default:
> +                    Assert(false);

Adding a default here prevents the compiler from issuing a warning when
new types of stats are added...


> +            /* Remove the successfully flushed entry */
> +            pfree(lent->env);

Probably worth zeroing the pointer here, to make debugging a little
easier.


> +    /* Publish the last flush time */
> +    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +    if (shared_globalStats->stats_timestamp < now)
> +        shared_globalStats->stats_timestamp = now;
> +    LWLockRelease(StatsLock);

Ugh, that seems like a fairly unnecessary global lock acquisition. What
do we need this timestamp for? Not clear to me that it's still
needed. If it is needed, it'd probably worth making this an atomic and
doing a compare-exchange loop instead.


>      /*
> -     * Send partial messages.  Make sure that any pending xact commit/abort
> -     * gets counted, even if there are no table stats to send.
> +     * If we have pending local stats, let the caller know the retry interval.
>       */
> -    if (regular_msg.m_nentries > 0 ||
> -        pgStatXactCommit > 0 || pgStatXactRollback > 0)
> -        pgstat_send_tabstat(®ular_msg);
> -    if (shared_msg.m_nentries > 0)
> -        pgstat_send_tabstat(&shared_msg);
> +    if (HAVE_ANY_PENDING_STATS())

I think this needs a comment explaining why we still may have pending
stats.



> + * flush_tabstat - flush out a local table stats entry
> + *
> + * Some of the stats numbers are copied to local database stats entry after
> + * successful flush-out.
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_tabstat(PgStatEnvelope * lenv, bool nowait)
> +{
> +    static const PgStat_TableCounts all_zeroes;
> +    Oid            dboid;            /* database OID of the table */
> +    PgStat_TableStatus *lstats; /* local stats entry  */
> +    PgStatEnvelope *shenv;        /* shared stats envelope */
> +    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
> +    PgStat_StatDBEntry *ldbstats;    /* local database entry */
> +    bool        found;
> +
> +    Assert(lenv->type == PGSTAT_TYPE_TABLE);
> +
> +    lstats = (PgStat_TableStatus *) &lenv->body;
> +    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
> +
> +    /*
> +     * Ignore entries that didn't accumulate any actual counts, such as
> +     * indexes that were opened by the planner but not used.
> +     */
> +    if (memcmp(&lstats->t_counts, &all_zeroes,
> +               sizeof(PgStat_TableCounts)) == 0)
> +        return true;
> +
> +    /* find shared table stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
> +                           nowait, init_tabentry, &found);
> +
> +    /* skip if dshash failed to acquire lock */
> +    if (shenv == NULL)
> +        return false;

Could we cache the address of the shared entry in the local entry for a
while? It seems we have a bunch of contention (that I think you're
trying to address in a prototoype patch posted since) just because we
will over and over look up the same address in the shared hash table.

If we instead kept the local hashtable alive for longer and stored a
pointer to the shared entry in it, we could make this a lot
cheaper. There would be some somewhat nasty edge cases probably. Imagine
a table being dropped for which another backend still has pending
stats. But that could e.g. be addressed with a refcount.


> +    /* retrieve the shared table stats entry from the envelope */
> +    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;
> +
> +    /* add the values to the shared entry. */
> +    shtabstats->numscans += lstats->t_counts.t_numscans;
> +    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
> +    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
> +    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
> +    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
> +    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
> +    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
> +
> +    /*
> +     * If table was truncated or vacuum/analyze has ran, first reset the
> +     * live/dead counters.
> +     */
> +    if (lstats->t_counts.t_truncated ||
> +        lstats->t_counts.vacuum_count > 0 ||
> +        lstats->t_counts.analyze_count > 0 ||
> +        lstats->t_counts.autovac_vacuum_count > 0 ||
> +        lstats->t_counts.autovac_analyze_count > 0)
> +    {
> +        shtabstats->n_live_tuples = 0;
> +        shtabstats->n_dead_tuples = 0;
> +    }

> +    /* clear the change counter if requested */
> +    if (lstats->t_counts.reset_changed_tuples)
> +        shtabstats->changes_since_analyze = 0;

I know this is largely old code, but it's not obvious to me that there's
no race conditions here / that the race condition didn't get worse. What
prevents other backends to since have done a lot of inserts into this
table? Especially in case the flushes were delayed due to lock
contention.

> +    /*
> +     * Update vacuum/analyze timestamp and counters, so that the values won't
> +     * goes back.
> +     */
> +    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
> +        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;

It seems to me that if these branches are indeed a necessary branches,
my concerns above are well founded...


> +init_tabentry(PgStatEnvelope * env)
>  {
> -    int            n;
> -    int            len;
> +    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
> +
> +    /*
> +     * If it's a new table entry, initialize counters to the values we just
> +     * got.
> +     */
> +    Assert(env->type == PGSTAT_TYPE_TABLE);
> +    tabent->tableid = env->objectid;

It seems over the top to me to have the object id stored in yet another
place. It's now in the hash entry, in the envelope, and the type
specific part.


> +/*
> + * flush_funcstat - flush out a local function stats entry
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_funcstat(PgStatEnvelope * env, bool nowait)
> +{
> +    /* we assume this inits to all zeroes: */
> +    static const PgStat_FunctionCounts all_zeroes;
> +    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
> +    PgStatEnvelope *shenv;        /* shared stats envelope */
> +    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
> +    bool        found;
> +
> +    Assert(env->type == PGSTAT_TYPE_FUNCTION);
> +    localent = (PgStat_BackendFunctionEntry *) &env->body;
> +
> +    /* Skip it if no counts accumulated for it so far */
> +    if (memcmp(&localent->f_counts, &all_zeroes,
> +               sizeof(PgStat_FunctionCounts)) == 0)
> +        return true;

Why would we have an entry in this case?


> +    /* find shared table stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
> +                           nowait, init_funcentry, &found);
> +    /* skip if dshash failed to acquire lock */
> +    if (shenv == NULL)
> +        return false;            /* failed to acquire lock, skip */
> +
> +    /* retrieve the shared table stats entry from the envelope */
> +    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;            /* failed to acquire lock, skip */

It doesn't seem great that we have a separate copy of all of this logic
again. It seems to me that most of the code here is (or should be)
exactly the same as in table case. I think only the the below should be
in here, rather than in common code.

> +/*
> + * flush_dbstat - flush out a local database stats entry
> + *
> + * If nowait is true, this function returns false on lock failure. Otherwise
> + * this function always returns true.
> + *
> + * Returns true if the entry is successfully flushed out.
> + */
> +static bool
> +flush_dbstat(PgStatEnvelope * env, bool nowait)
> +{
> +    PgStat_StatDBEntry *localent;
> +    PgStatEnvelope *shenv;
> +    PgStat_StatDBEntry *sharedent;
> +
> +    Assert(env->type == PGSTAT_TYPE_DB);
> +
> +    localent = (PgStat_StatDBEntry *) &env->body;
> +
> +    /* find shared database stats entry corresponding to the local entry */
> +    shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
> +                           nowait, init_dbentry, NULL);
> +
> +    /* skip if dshash failed to acquire lock */
> +    if (!shenv)
> +        return false;
> +
> +    /* retrieve the shared stats entry from the envelope */
> +    sharedent = (PgStat_StatDBEntry *) &shenv->body;
> +
> +    /* lock the shared entry to protect the content, skip if failed */
> +    if (!nowait)
> +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> +        return false;

Dito re duplicating all of this.



> +/*
> + * Create the filename for a DB stat file; filename is output parameter points
> + * to a character buffer of length len.
> + */
> +static void
> +get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
> +{
> +    int            printed;
> +
> +    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
> +    printed = snprintf(filename, len, "%s/db_%u.%s",
> +                       PGSTAT_STAT_PERMANENT_DIRECTORY,
> +                       databaseid,
> +                       tempname ? "tmp" : "stat");
> +    if (printed >= len)
> +        elog(ERROR, "overlength pgstat path");
>  }

Do we really want database specific storage after all of these changes?
Seems like there's no point anymore?


> +    dshash_seq_init(&hstat, pgStatSharedHash, false);
> +    while ((p = dshash_seq_next(&hstat)) != NULL)
>      {
> -        Oid            tabid = tabentry->tableid;
> -
> -        CHECK_FOR_INTERRUPTS();
> -

Given that this could take a while on a database with a lot of objects
it might worth keeping the CHECK_FOR_INTERRUPTS().


>
>  /* ----------
> - * pgstat_vacuum_stat() -
> + * collect_stat_entries() -
>   *
> - *    Will tell the collector about objects he can get rid of.
> + *    Collect the shared statistics entries specified by type and dbid. Returns a
> + *  list of pointer to shared statistics in palloc'ed memory. If type is
> + *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
> + *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
> + *  PGSTAT_TYPE_DB entries.
>   * ----------
>   */
> -void
> -pgstat_vacuum_stat(void)
> +static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
>  {

> -        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
> +        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
> +            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
>              continue;

I don't like this interface much. Particularly not that it requires
adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
nice either.

> -        /*
> -         * Not there, so add this table's Oid to the message
> -         */
> -        msg.m_tableid[msg.m_nentries++] = tabid;
> -
> -        /*
> -         * If the message is full, send it out and reinitialize to empty
> -         */
> -        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
> +        if (n >= listlen - 1)
>          {
> -            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
> -                + msg.m_nentries * sizeof(Oid);
> -
> -            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
> -            msg.m_databaseid = MyDatabaseId;
> -            pgstat_send(&msg, len);
> -
> -            msg.m_nentries = 0;
> +            listlen *= 2;
> +            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
>          }
> +        envlist[n++] = dsa_get_address(area, p->env);
>      }

I'd use List here as well.


> +    dshash_seq_term(&hstat);

Hm, I didn't immediately see which locking makes this safe? Is it just
that nobody should be attached at this point?


> +void
> +pgstat_vacuum_stat(void)
> +{
> +    HTAB       *dbids;            /* database ids */
> +    HTAB       *relids;            /* relation ids in the current database */
> +    HTAB       *funcids;        /* function ids in the current database */
> +    PgStatEnvelope **victims;    /* victim entry list */
> +    int            arraylen = 0;    /* storage size of the above */
> +    int            nvictims = 0;    /* # of entries of the above */
> +    dshash_seq_status dshstat;
> +    PgStatHashEntry *ent;
> +    int            i;
> +
> +    /* we don't collect stats under standalone mode */
> +    if (!IsUnderPostmaster)
> +        return;
> +
> +    /* collect oids of existent objects */
> +    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
> +    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
> +    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
> +
> +    /* collect victims from shared stats */
> +    arraylen = 16;
> +    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
> +    nvictims = 0;

Same List comment as before.



>  void
>  pgstat_reset_counters(void)
>  {
> -    PgStat_MsgResetcounter msg;
> +    PgStatEnvelope **envlist;
> +    PgStatEnvelope **p;
>
> -    if (pgStatSock == PGINVALID_SOCKET)
> -        return;
> +    /* Lookup the entries of the current database in the stats hash. */
> +    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
> +    for (p = envlist; *p != NULL; p++)
> +    {
> +        PgStatEnvelope *env = *p;
> +        PgStat_StatDBEntry *dbstat;
>
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
> -    msg.m_databaseid = MyDatabaseId;
> -    pgstat_send(&msg, sizeof(msg));
> +        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
> +

What locking prevents this entry from being freed between the
collect_stat_entries() and this LWLockAcquire?



>  /* ----------
> @@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
>  void
>  pgstat_report_autovac(Oid dboid)
>  {
> -    PgStat_MsgAutovacStart msg;
> +    PgStat_StatDBEntry *dbentry;
> +    TimestampTz ts;
>
> -    if (pgStatSock == PGINVALID_SOCKET)
> +    /* return if activity stats is not active */
> +    if (!area)
>          return;
>
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
> -    msg.m_databaseid = dboid;
> -    msg.m_start_time = GetCurrentTimestamp();
> +    ts = GetCurrentTimestamp();
>
> -    pgstat_send(&msg, sizeof(msg));
> +    /*
> +     * Store the last autovacuum time in the database's hash table entry.
> +     */
> +    dbentry = get_local_dbstat_entry(dboid);
> +    dbentry->last_autovac_time = ts;
>  }

Why did you introduce the local ts variable here?


>  /* --------
>   * pgstat_report_analyze() -
>   *
> - *    Tell the collector about the table we just analyzed.
> + *    Report about the table we just analyzed.
>   *
>   * Caller must provide new live- and dead-tuples estimates, as well as a
>   * flag indicating whether to reset the changes_since_analyze counter.
> @@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
>                        PgStat_Counter livetuples, PgStat_Counter deadtuples,
>                        bool resetcounter)
>  {
>  }

It seems to me that the analyze / vacuum cases would be much better
dealth with by synchronously operating on the shared entry, instead of
going through the local hash table. ISTM that that'd make it a lot
easier to avoid most of the ordering issues.




> +static PgStat_TableStatus *
> +get_local_tabstat_entry(Oid rel_id, bool isshared)
> +{
> +    PgStatEnvelope *env;
> +    PgStat_TableStatus *tabentry;
> +    bool        found;
>
> -    /*
> -     * Now we can fill the entry in pgStatTabHash.
> -     */
> -    hash_entry->tsa_entry = entry;
> +    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
> +                               isshared ? InvalidOid : MyDatabaseId,
> +                               rel_id, true, &found);
>
> -    return entry;
> +    tabentry = (PgStat_TableStatus *) &env->body;
> +
> +    if (!found)
> +    {
> +        tabentry->t_id = rel_id;
> +        tabentry->t_shared = isshared;
> +        tabentry->trans = NULL;
> +        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
> +        tabentry->vacuum_timestamp = 0;
> +        tabentry->autovac_vacuum_timestamp = 0;
> +        tabentry->analyze_timestamp = 0;
> +        tabentry->autovac_analyze_timestamp = 0;
> +    }
> +

As with shared entries, I think this should just be zero initialized
(and we should try to get rid of the duplication of t_id/t_shared).

> +    return tabentry;
>  }
>
> +
>  /*
>   * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
>   *
> - * If no entry, return NULL, don't create a new one
> + *  Find any existing PgStat_TableStatus entry for rel from the current
> + *  database then from shared tables.

What do you mean with "from the current database then from shared
tables"?


>  void
> -pgstat_send_archiver(const char *xlog, bool failed)
> +pgstat_report_archiver(const char *xlog, bool failed)
>  {
> -    PgStat_MsgArchiver msg;
> +    TimestampTz now = GetCurrentTimestamp();
>
> -    /*
> -     * Prepare and send the message
> -     */
> -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
> -    msg.m_failed = failed;
> -    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
> -    msg.m_timestamp = GetCurrentTimestamp();
> -    pgstat_send(&msg, sizeof(msg));
> +    if (failed)
> +    {
> +        /* Failed archival attempt */
> +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +        ++shared_archiverStats->failed_count;
> +        memcpy(shared_archiverStats->last_failed_wal, xlog,
> +               sizeof(shared_archiverStats->last_failed_wal));
> +        shared_archiverStats->last_failed_timestamp = now;
> +        LWLockRelease(StatsLock);
> +    }
> +    else
> +    {
> +        /* Successful archival operation */
> +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> +        ++shared_archiverStats->archived_count;
> +        memcpy(shared_archiverStats->last_archived_wal, xlog,
> +               sizeof(shared_archiverStats->last_archived_wal));
> +        shared_archiverStats->last_archived_timestamp = now;
> +        LWLockRelease(StatsLock);
> +    }
>  }

Huh, why is this duplicating near equivalent code?

>  /* ----------
>   * pgstat_write_statsfiles() -
> - *        Write the global statistics file, as well as requested DB files.
> - *
> - *    'permanent' specifies writing to the permanent files not temporary ones.
> - *    When true (happens only when the collector is shutting down), also remove
> - *    the temporary files so that backends starting up under a new postmaster
> - *    can't read old data before the new collector is ready.
> - *
> - *    When 'allDbs' is false, only the requested databases (listed in
> - *    pending_write_requests) will be written; otherwise, all databases
> - *    will be written.
> + *        Write the global statistics file, as well as DB files.
>   * ----------
>   */
> -static void
> -pgstat_write_statsfiles(bool permanent, bool allDbs)
> +void
> +pgstat_write_statsfiles(void)
>  {

Whats the locking around this?


> -pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> +pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
>  {
> -    HASH_SEQ_STATUS tstat;
> -    HASH_SEQ_STATUS fstat;
> -    PgStat_StatTabEntry *tabentry;
> -    PgStat_StatFuncEntry *funcentry;
> +    PgStatEnvelope **envlist;
> +    PgStatEnvelope **penv;
>      FILE       *fpout;
>      int32        format_id;
>      Oid            dbid = dbentry->databaseid;
> @@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
>      char        tmpfile[MAXPGPATH];
>      char        statfile[MAXPGPATH];
>
> -    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
> -    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
> +    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
> +    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
>
>      elog(DEBUG2, "writing stats file \"%s\"", statfile);
>
> @@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
>      /*
>       * Walk through the database's access stats per table.
>       */
> -    hash_seq_init(&tstat, dbentry->tables);
> -    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
> +    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
> +    for (penv = envlist; *penv != NULL; penv++)

In several of these collect_stat_entries() callers it really bothers me
that we basically allocate an array as large as the number of objects
in the database (That's fine for databases, but for tables...). Without
much need as far as I can see.


>      {
> +        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
> +
>          fputc('T', fpout);
>          rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
>          (void) rc;                /* we'll check for error with ferror */
>      }
> +    pfree(envlist);
>
>      /*
>       * Walk through the database's function stats table.
>       */
> -    hash_seq_init(&fstat, dbentry->functions);
> -    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
> +    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
> +    for (penv = envlist; *penv != NULL; penv++)
>      {
> +        PgStat_StatFuncEntry *funcentry =
> +        (PgStat_StatFuncEntry *) &(*penv)->body;
> +
>          fputc('F', fpout);
>          rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
>          (void) rc;                /* we'll check for error with ferror */
>      }
> +    pfree(envlist);

Why do we need separate loops for every type of object here?


> +/* ----------
> + * create_missing_dbentries() -
> + *
> + *  There may be the case where database entry is missing for the database
> + *  where object stats are recorded. This function creates such missing
> + *  dbentries so that so that all stats entries can be written out to files.
> + * ----------
> + */
> +static void
> +create_missing_dbentries(void)
> +{

In which situation is this necessary?


> +static PgStatEnvelope *
> +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
> +               bool nowait, entry_initializer initfunc, bool *found)
> +{

> +    bool        create = (initfunc != NULL);
> +    PgStatHashEntry *shent;
> +    PgStatEnvelope *shenv = NULL;
> +    PgStatHashEntryKey key;
> +    bool        myfound;
> +
> +    Assert(type != PGSTAT_TYPE_ALL);
> +
> +    key.type = type;
> +    key.databaseid = dbid;
> +    key.objectid = objid;
> +    shent = dshash_find_extended(pgStatSharedHash, &key,
> +                                 create, nowait, create, &myfound);
> +    if (shent)
>      {
> -        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
> +        if (create && !myfound)
> +        {
> +            /* Create new stats envelope. */
> +            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
> +            dsa_pointer chunk = dsa_allocate0(area, envsize);

> +            /*
> +             * The lock on dshsh is released just after. Call initializer
> +             * callback before it is exposed to other process.
> +             */
> +            if (initfunc)
> +                initfunc(shenv);
> +
> +            /* Link the new entry from the hash entry. */
> +            shent->env = chunk;
> +        }
> +        else
> +            shenv = dsa_get_address(area, shent->env);
> +
> +        dshash_release_lock(pgStatSharedHash, shent);

Doesn't this mean that by this time the entry could already have been
removed by a concurrent backend, and the dsa allocation freed?


> Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory
>
> The GUC used to specify the directory to store temporary statistics
> files. It is no longer needed by the stats collector but still used by
> the programs in bin and contrib, and maybe other extensions. Thus this
> patch removes the GUC but some backing variables and macro definitions
> are left alone for backward compatibility.

I don't see what this achieves? Which use of those variables / macros
would would be safe? I think it'd be better to just remove them.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thanks for reviewing!

At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> > Locks on the shared statistics is acquired by the units of such like
> > tables, functions so the expected chance of collision are not so high.
> 
> I can't really parse that...

Mmm... Is the following readable?

Shared statistics locks are acquired by units such as tables,
functions, etc., so the chances of an expected collision are not so
high.

Anyway, this is found to be wrong, so I removed it.

> > Furthermore, until 1 second has elapsed since the last flushing to
> > shared stats, lock failure postpones stats flushing so that lock
> > contention doesn't slow down transactions.
> 
> I think I commented on that before, but to me 1s seems way too low to
> switch to blocking lock acquisition. What's the reason for such a low
> limit?

It was 0.5 seconds previously.  I don't have a clear idea of a
reasonable value for it. One possible rationale might be to have 1000
clients each have a writing time slot of 10ms.. So, 10s as the minimum
interval. I set maximum interval to 60, and retry interval to
1s. (Fixed?)

> >      /*
> > -     * Clean up any dead statistics collector entries for this DB. We always
> > +     * Clean up any dead activity statistics entries for this DB. We always
> >       * want to do this exactly once per DB-processing cycle, even if we find
> >       * nothing worth vacuuming in the database.
> >       */
> 
> What is "activity statistics"?

I don't get your point. It is formally the replacement word for
"statistics collector". The "statistics collector (process)" no longer
exists, so I had to invent a name for the successor mechanism that is
distinguishable with data/column statistics.  If it is not the proper
wording, I'd appreciate it if you could suggest the appropriate one.

> > @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
> >      }
> >
> >      /* fetch the pgstat table entry */
> > -    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> > -                                         shared, dbentry);
> > +    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> > +                                                   relid);
> 
> Why do all of these places deal with a snapshot? For most it seems to
> make much more sense to just look up the entry and then copy that into
> local memory?  There may be some place that need some sort of snapshot
> behaviour that's stable until commit / pgstat_clear_snapshot(). But I
> can't reallly see many?

Ok, I reread this thread and agree that there's a (vague) consensus to
remove the snapshot stuff. Backend-statistics (bestats) still are
stable during a transaction.


> > +#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
> >
> > +#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
> > +                                             * updates */
> 
> These don't really seem to be in line with the commit message...

Oops! Sorry. Fixed both of this value and the commit message (and the
file comment).

> > + * dshash pgStatSharedHash
> > + *    -> PgStatHashEntry                (dshash entry)
> > + *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
> 
> I don't like 'Envelope' that much. If I understand you correctly that's
> a common prefix that's used for all types of stat objects, correct? If
> so, how about just naming it PgStatEntryBase or such? I think it'd also
> be useful to indicate in the "are stored as" part that PgStatEnvelope is
> just the common prefix for an allocation.

The name makes sense. Thanks! (But the struct is now gone..)

> > -typedef struct TabStatHashEntry
> > +static size_t pgstat_entsize[] =
> 
> > +/* Ditto for local statistics entries */
> > +static size_t pgstat_localentsize[] =
> > +{
> > +    0,                            /* PGSTAT_TYPE_ALL: not an entry */
> > +    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> > +    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> > +    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> > +};
> 
> These probably should be const as well.

Right. Fixed.

> 
> >  /*
> > - * Backends store per-function info that's waiting to be sent to the collector
> > - * in this hash table (indexed by function OID).
> > + * Stats numbers that are waiting for flushing out to shared stats are held in
> > + * pgStatLocalHash,
> >   */
> > -static HTAB *pgStatFunctions = NULL;
> > +typedef struct PgStatHashEntry
> > +{
> > +    PgStatHashEntryKey key;        /* hash key */
> > +    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
> > +}            PgStatHashEntry;
> > +
> > +/* struct for shared statistics entry pointed from shared hash entry. */
> > +typedef struct PgStatEnvelope
> > +{
> > +    PgStatTypes type;            /* statistics entry type */
> > +    Oid            databaseid;        /* databaseid */
> > +    Oid            objectid;        /* objectid */
> 
> Do we need this information both here and in PgStatHashEntry? It's
> possible that it's worthwhile, but I am not sure it is.

Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
and PgStat_Stats*Entry. And I thought the same while developing. After
some thoughts, I managed to remove the duplicate values other than
PgStat(Local)HashEntry. Fixed.


> > +    size_t        len;            /* length of body, fixed per type. */
> 
> Why do we need this? Isn't that something that can easily be looked up
> using the type?

Not only they are virtually fixed values, but they were found to be
write-only variables. Removed.


> > +    LWLock        lock;            /* lightweight lock to protect body */
> > +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> > +}            PgStatEnvelope;
> 
> What you're doing here with 'body' doesn't provide enough guarantees
> about proper alignment. E.g. if one of the entry types wants to store a
> double, this won't portably work, because there's platforms that have 4
> byte alignment for ints, but 8 byte alignment for doubles.
> 
> 
> Wouldn't it be better to instead embed PgStatEnvelope into the struct
> that's actually stored? E.g. something like
> 
> struct PgStat_TableStatus
> {
>     PgStatEnvelope header; /* I'd rename the type */
>     TimestampTz vacuum_timestamp;    /* user initiated vacuum */
>     ...
> }
> 
> or if you don't want to do that because it'd require declaring
> PgStatEnvelope in the header (not sure that'd really be worth avoiding),
> you could just get rid of the body field and just do the calculation
> using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))

As the result of the modification so far, there is only one member,
lock, left in the PgStatEnvelope (or PgStatEntryBase) struct.  I chose
to embed it to each PgStat_Stat*Entry structs as
PgStat_StatEntryHeader.

> > + * Snapshot is stats entry that is locally copied to offset stable values for a
> > + * transaction.
...
> The amount of code needed for this snapshot stuff seems unreasonable to
> me, especially because I don't see why we really need it. Is this just
> so that there's no skew between all the columns of pg_stat_all_tables()
> etc?
> 
> I think this needs a lot more comments explaining what it's trying to
> achieve.

I don't insist on keeping the behavior.  Removed snapshot stuff only
of pgstat stuff. (beentry snapshot is left alone.)

> > +/*
> > + * Newly created shared stats entries needs to be initialized before the other
> > + * processes get access it. get_stat_entry() calls it for the purpose.
> > + */
> > +typedef void (*entry_initializer) (PgStatEnvelope * env);
> 
> I think we should try to not need it, instead declaring that all fields
> are zero initialized. That fits well together with my suggestion to
> avoid duplicating the database / object ids.

Now that entries don't have type-specific fields that need a special
care, I removed that stuff altogether.

> > +static void
> > +attach_shared_stats(void)
> > +{
...
> > +        shared_globalStats = (PgStat_GlobalStats *)
> > +            dsa_get_address(area, StatsShmem->global_stats);
> > +        shared_archiverStats = (PgStat_ArchiverStats *)
> > +            dsa_get_address(area, StatsShmem->archiver_stats);
> > +
> > +        shared_SLRUStats = (PgStatSharedSLRUStats *)
> > +            dsa_get_address(area, StatsShmem->slru_stats);
> > +        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
> 
> I don't think it makes sense to use dsa allocations for any of the fixed
> size stats (global_stats, archiver_stats, ...). They should just be
> direct members of StatsShmem? Then we also don't need the shared_*
> helper variables

I intended to reduce the amount of fixed-allocated shared memory, or
make maximum use of DSA. However, you're right. Now they are members
of StatsShmem.

> 
> > +        /* Load saved data if any. */
> > +        pgstat_read_statsfiles();
> 
> Hm. Is it a good idea to do this as part of the shmem init function?
> That's a lot more work than we normally do in these.
> 
> > +/* ----------
> > + * detach_shared_stats() -
> > + *
> > + *    Detach shared stats. Write out to file if we're the last process and told
> > + *    to do so.
> > + * ----------
> >   */
> >  static void
> > -pgstat_reset_remove_files(const char *directory)
> > +detach_shared_stats(bool write_stats)
> 
> I think it'd be better to have an explicit call in the shutdown sequence
> somewhere to write out the data, instead of munging detach and writing
> stats out together.

It is actually strange that attach_shared_stats reads file in a
StatsLock section while it attaches existing shared memory area
deliberately outside the same lock section. So I moved the call to
pg_stat_read/write_statsfiles() out of StatsLock section as the first
step. But I couldn't move pgstat_write_stats_files() out of (or,
before or after) detach_shared_stats(), because I didn't find a way to
reliably check if the exiting process is the last detacher by a
separate function from detach_shared_stats().

(continued)
=====

The attached is the updated version taking in the comments above. I
continue to address the rest of the comments. Only 0004 is revised.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

From aacc7d69a46f3d18fb5b7dc2c7cde901e0cfc405 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v37 1/6] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 149 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  21 ++++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..1ef093e2e9 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,145 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..0ca9514021 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,12 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From bc5d68b4782f83e707409ef026cb7039b7f127da Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v37 2/6] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 1ef093e2e9..d7ee6de11e 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 0ca9514021..5f7a60febd 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 2d999775b2a6744e4440c6ce87f1d4752860a697 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v37 3/6] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 8f8734dc1d..2e6c322142 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 76b2f5066f..81bfaea869 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From caaa1f49820940b6e38cc76458777a1b37acebac Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v37 4/6] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5106 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   59 +-
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  599 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2232 insertions(+), 3895 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168d..3cb6e20ed5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -599,7 +599,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..988603c6bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8581,9 +8581,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    BgWriterStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    BgWriterStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0974f3e23a..9507fb8210 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1688,28 +1688,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..c5477ff567 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..9a41474e53 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e7dcd4f76..7c9765a064 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            BgWriterStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                BgWriterStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -496,13 +496,13 @@ CheckpointerMain(void)
         CheckArchiveTimeout();
 
         /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
+         * Send off activity statistics to the activity stats facility.  (The
+         * reason why we re-use bgwriter-related code for this is that the
+         * bgwriter and checkpointer used to be just one process.  It's
+         * probably not worth the trouble to split the stats support into two
+         * independent stats message types.)
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -708,9 +708,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1255,8 +1255,8 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+    BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..1cd4cb20b6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (600000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,32 +35,25 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -73,35 +61,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            600000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +89,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,16 +103,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -159,73 +126,170 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_StatSLRUEntry    entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+
+    /* glboal stats */
+    PgStat_GlobalStats        global_stats;
+    PgStat_ArchiverStats    archiver_stats;
+    PgStatSharedSLRUStats    slru_stats;
+
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+    
+    pg_atomic_uint64 del_ent_count; /* # of entries deleted. not protected by
+                                       StatsLock */
+}            StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_StatSLRUEntry local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
  *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Local stats are stored as:
+ *
+ * dshash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (dynahash entry)
+ *      (direct pointer)-> PgStat_Stat*Entry (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries. */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_ALL,            /* Not a type, for the parameters of
+                                 * pgstat_collect_stat_entries */
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION        /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),/* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)/* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    0,                            /* PGSTAT_TYPE_ALL: not an entry */
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
- */
-static HTAB *pgStatFunctions = NULL;
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashEntryKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object OID */
+}            PgStatHashEntryKey;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
  */
-static bool have_function_stats = false;
+typedef struct PgStatHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    dsa_pointer body;            /* pointer to shared stats in
+                                 * PgStat_StatEntryHeader */
+}            PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashEntryKey key;        /* hash key */
+    char               status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body; /* pointer to stats body in local heap */
+}            PgStatLocalHashEntry;
+
+/* parameter for shared hashes */
+static const dshash_parameters dsh_rootparams = {
+    sizeof(PgStatHashEntryKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for dshash caching */
+#define SH_PREFIX pgstat_cache
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashEntryKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) hash_bytes((unsigned char *)&key, sizeof(PgStatHashEntryKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashEntryKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
+
+/* Local stats numbers are stored here */
+static HTAB *pgStatLocalHash = NULL;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +326,10 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
+static pgstat_cache_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +338,13 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static pg_atomic_uint32        global_changecount;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,526 +353,296 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
-
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
+static PgStatLocalHashEntry *collect_stat_entries(PgStatTypes type, Oid dbid);
+static void create_missing_dbentries(void);
+static void pgstat_write_database_stats(Oid dbid);
+static void pgstat_read_db_statsfile(Oid dbid);
+
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_slrustat(bool nowait);
+
+static bool delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                              bool nowait);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static PgStat_SubXactStatus *get_tabstat_stack_level(int nest_level);
+static void add_tabstat_xact_level(PgStat_TableStatus *pgstat_info,
+                                   int nest_level);
+
 static void pgstat_read_current_status(void);
-
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
-
-static void pgstat_setup_memcxt(void);
-
 static const char *pgstat_get_wait_activity(WaitEventActivity w);
 static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u64(&StatsShmem->del_ent_count, 0);
+        pg_atomic_init_u32(&global_changecount, 0);
     }
+}
 
-    /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
-     */
-    for (addr = addrs; addr; addr = addr->ai_next)
+/* ----------
+ * pgstat_setup_memcxt() -
+ *
+ *    Create pgStatLocalContext if not already done.
+ * ----------
+ */
+static void
+pgstat_setup_memcxt(void)
+{
+    if (!pgStatLocalContext)
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
+}
+
+
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *    Allow another process to attach the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+    
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
-
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
-
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
-
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read saved statistics files if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that processes going to exit can find whether it
+     * is the last or not.
      */
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
-    }
-
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
 
-    return;
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+        /* Inhibit next attacher for a while */
+        StatsShmem->attach_holdoff = true;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+        StatsShmem->refcount = 1;
+    }
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    LWLockRelease(StatsLock);
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * If we're the first process, load file if any then allow next attachers.
+     * if not, attach existing shared stats area outside the StatsLock section.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    if (area)
+    {
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfiles();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* Attach shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_rootparams,
+                                         StatsShmem->hash_handle, 0);
+
+    }
+
+    MemoryContextSwitchTo(oldcontext);
+
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /*
+     * If we are the last detacher, hold off the next attacher until we finish
+     * writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
+    {
+        /* No one is using the area. */
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    if (is_last_detacher)
     {
-        int            nchars;
-        Oid            tmp_oid;
-
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+        if (write_file)
+            pgstat_write_statsfiles();
+
+        /* allow the next attacher */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+    
+    /*
+     * Detach the area. Automatically destroyed when the last process detached
+     * it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
-}
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
 
-#ifdef EXEC_BACKEND
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-/*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
- */
-static pid_t
-pgstat_forkexec(void)
-{
-    char       *av[10];
-    int            ac = 0;
-
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
-
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
-
-    return postmaster_forkexec(ac, av);
-}
-#endif                            /* EXEC_BACKEND */
-
-
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgstat_start(void)
-{
-    time_t        curtime;
-    pid_t        pgStatPid;
-
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
-
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
 
     /*
-     * Okay, fork off the collector.
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgStatPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
-void
-allow_immediate_pgstat_restart(void)
-{
-    last_pgstat_start_time = 0;
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +650,426 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait = !force;    /* Don't use force ever after */
+    HASH_SEQ_STATUS scan;
+    PgStatLocalHashEntry *lent;
+    PgStatLocalHashEntry **dbentlist;
+    int            dbentlistlen = 8;
+    int            ndbentries = 0;
+    int            remains = 0;
     int            i;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
+
+    dbentlist = palloc(sizeof(PgStatLocalHashEntry *) * dbentlistlen);
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
+    if (nowait)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            nowait = false;
+    }
 
     /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
+     * flush_tabstat applies some of stats numbers of flushed entries into
+     * local database stats. So flush-out database stats later.
      */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /* Step 1: flush out other than database stats */
+        hash_seq_init(&scan, pgStatLocalHash);
+        while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    if (ndbentries >= dbentlistlen)
+                    {
+                        dbentlistlen *= 2;
+                        dbentlist = repalloc(dbentlist,
+                                             sizeof(PgStatLocalHashEntry *) *
+                                             dbentlistlen);
+                    }
+                    dbentlist[ndbentries++] = lent;
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                default:
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
+        }
+
+        /* Step 2: flush out database stats */
+        for (i = 0; i < ndbentries; i++)
+        {
+            PgStatLocalHashEntry *lent = dbentlist[i];
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                hash_search(pgStatLocalHash, &lent->key, HASH_REMOVE, NULL);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        pfree(dbentlist);
+
+        if (remains <= 0)
+        {
+            hash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /* Publish the last flush time */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->global_stats.stats_timestamp < now)
+    {
+        uint32 assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        StatsShmem->global_stats.stats_timestamp = now;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+    }
+    LWLockRelease(StatsLock);
+
+    /*
+     * If we have pending local stats, let the caller know the retry interval.
+     */
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
+
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
+}
+
+/*
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+    bool        found;
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id, nowait, true,
+                       &found);
+
+    /* skip if dshash failed to acquire lock */
+    if (shtabstats == NULL)
+        return false;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shtabstats->header.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shtabstats->header.lock, LW_EXCLUSIVE))
+        return false;
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated ||
+        lstats->t_counts.vacuum_count > 0 ||
+        lstats->t_counts.analyze_count > 0 ||
+        lstats->t_counts.autovac_vacuum_count > 0 ||
+        lstats->t_counts.autovac_analyze_count > 0)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
     }
 
+    /* clear the change counter if requested */
+    if (lstats->t_counts.reset_changed_tuples)
+        shtabstats->changes_since_analyze = 0;
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Update vacuum/analyze timestamp and counters, so that the values won't
+     * goes back.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
+    shtabstats->vacuum_count += lstats->t_counts.vacuum_count;
+
+    if (shtabstats->autovac_vacuum_timestamp < lstats->autovac_vacuum_timestamp)
+        shtabstats->autovac_vacuum_timestamp = lstats->autovac_vacuum_timestamp;
+    shtabstats->autovac_vacuum_count += lstats->t_counts.autovac_vacuum_count;
+
+    if (shtabstats->analyze_timestamp < lstats->analyze_timestamp)
+        shtabstats->analyze_timestamp = lstats->analyze_timestamp;
+    shtabstats->analyze_count += lstats->t_counts.analyze_count;
+
+    if (shtabstats->autovac_analyze_timestamp < lstats->autovac_analyze_timestamp)
+        shtabstats->autovac_analyze_timestamp = lstats->autovac_analyze_timestamp;
+    shtabstats->autovac_analyze_count += lstats->t_counts.autovac_analyze_count;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    /* we assume this inits to all zeroes: */
+    static const PgStat_FunctionCounts all_zeroes;
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+    bool        found;
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* Skip it if no counts accumulated for it so far */
+    if (memcmp(&localent->f_counts, &all_zeroes,
+               sizeof(PgStat_FunctionCounts)) == 0)
+        return true;
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                       localent->f_id, nowait, true, &found);
+    /* skip if dshash failed to acquire lock */
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&shfuncent->header.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&shfuncent->header.lock, LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    LWLockRelease(&shfuncent->header.lock);
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+    return true;
 }
 
+
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                       nowait, true, NULL);
+
+    /* skip if dshash failed to acquire lock */
+    if (!sharedent)
+        return false;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&sharedent->header.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&sharedent->header.lock, LW_EXCLUSIVE))
+        return false;
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,257 +1077,137 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
 /*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_StatSLRUEntry *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_StatSLRUEntry *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
+    }
+
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+
+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
  */
 static void
-pgstat_send_funcstats(void)
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
+    int            printed;
+
+    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+    printed = snprintf(filename, len, "%s/db_%u.%s",
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
+                       databaseid,
+                       tempname ? "tmp" : "stat");
+    if (printed >= len)
+        elog(ERROR, "overlength pgstat path");
 }
 
 
 /* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *    Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
  * ----------
  */
-void
-pgstat_vacuum_stat(void)
+static PgStatLocalHashEntry *
+collect_stat_entries(PgStatTypes type, Oid dbid)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    int                listlen = 16;
+    int                n = 0;
+    PgStatLocalHashEntry *entlist =
+        palloc(sizeof(PgStatLocalHashEntry) * listlen);
+
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        if (n >= listlen - 1)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            listlen *= 2;
+            entlist = repalloc(entlist,
+                               listlen * sizeof(PgStatLocalHashEntry));
         }
+        entlist[n].key        = p->key;
+        entlist[n++].body    = dsa_get_address(area, p->body);
     }
+    dshash_seq_term(&hstat);
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
+    entlist[n].body = NULL;
 
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    return entlist;
 }
 
 
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
  *    into a temporary hash table.  Caller should hash_destroy the result
@@ -1231,7 +1216,7 @@ pgstat_vacuum_stat(void)
  * ----------
  */
 static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
     HTAB       *htab;
     HASHCTL        hash_ctl;
@@ -1245,7 +1230,7 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     hash_ctl.entrysize = sizeof(Oid);
     hash_ctl.hcxt = CurrentMemoryContext;
     htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
+                       PGSTAT_TABLE_HASH_SIZE,
                        &hash_ctl,
                        HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
@@ -1272,65 +1257,193 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
 }
 
 
+/* ----------
+ * pgstat_vacuum_stat() -
+ *
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
+ * ----------
+ */
+void
+pgstat_vacuum_stat(void)
+{
+    HTAB       *dbids;            /* database ids */
+    HTAB       *relids;            /* relation ids in the current database */
+    HTAB       *funcids;        /* function ids in the current database */
+    PgStatHashEntryKey *victims;    /* victim entry list */
+    int            arraylen = 0;    /* storage size of the above */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+    int            i;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+    /* collect victims from shared stats */
+    arraylen = 16;
+    victims = palloc(sizeof(PgStatHashEntryKey) * arraylen);
+    nvictims = 0;
+
+    dshash_seq_init(&dshstat, pgStatSharedHash, false);
+
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
+    {
+        HTAB       *oidtab;
+        Oid           *key;
+
+        CHECK_FOR_INTERRUPTS();
+
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
+
+        switch (ent->key.type)
+        {
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidtab = dbids;
+                key = &ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidtab = relids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidtab = funcids;
+                key = &ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_ALL:
+                Assert(false);
+                break;
+        }
+
+        /* Skip existent objects. */
+        if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
+            continue;
+
+        /* extend the list if needed */
+        if (nvictims >= arraylen)
+        {
+            arraylen *= 2;
+            victims = repalloc(victims, sizeof(PgStatHashEntryKey) * arraylen);
+        }
+
+        victims[nvictims++] = ent->key;
+    }
+    dshash_seq_term(&dshstat);
+    hash_destroy(dbids);
+    hash_destroy(relids);
+    hash_destroy(funcids);
+
+    /* Now try removing the victim entries */
+    for (i = 0; i < nvictims; i++)
+    {
+        PgStatHashEntryKey *k = &victims[i];
+
+        delete_stat_entry(k->type, k->databaseid, k->objectid, true);
+    }
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
+}
+
+
+/* ----------
+ * delete_stat_entry() -
+ *
+ *  Deletes the specified entry from shared stats hash.
+ *
+ *  Returns true when successfully deleted.
+ * ----------
+ */
+static bool
+delete_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait)
+{
+    PgStatHashEntryKey key;
+    PgStatHashEntry *ent;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    ent = dshash_find_extended(pgStatSharedHash, &key,
+                               true, nowait, false, NULL);
+
+    if (!ent)
+        return false;            /* lock failed or not found */
+
+    /* The entry is exclusively locked, so we can free the chunk first. */
+    dsa_free(area, ent->body);
+    dshash_delete_entry(pgStatSharedHash, ent);
+
+    return true;
+}
+
+
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    PgStatLocalHashEntry *dellist;
+    PgStatLocalHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    Assert(OidIsValid(databaseid));
 
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
+    dellist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+
+    for (p = dellist; p->body != NULL; p++)
+    {
+        PgStatHashEntryKey *k = &p->key;
+        delete_stat_entry(k->type, k->databaseid, k->objectid, true);
+    }
 
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
+    pfree(dellist);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1452,32 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    PgStatLocalHashEntry *resetlist;
+    PgStatLocalHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    /* Lookup the entries of the current database in the stats hash. */
+    resetlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+    for (p = resetlist; p->body != NULL; p++)
+    {
+        LWLockAcquire(&p->body->lock, LW_EXCLUSIVE);
+        memset(p->body, 0, pgstat_sharedentsize[p->key.type]);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) &p->body;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+
+        LWLockRelease(&p->body->lock);
+    }
+
+    pfree(resetlist);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1486,45 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
 
+    /* Reset the archiver statistics for the cluster. */
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_ArchiverStats));
+        StatsShmem->archiver_stats.stat_reset_timestamp = now;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
+    /* Reset the bgwriter statistics for the cluster. */
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+    {
+        TimestampTz now = GetCurrentTimestamp();
+
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->global_stats, 0, sizeof(PgStat_GlobalStats));
+        StatsShmem->global_stats.stat_reset_timestamp = now;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1533,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(header, 0, pgstat_sharedentsize[stattype]);
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1418,15 +1579,36 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_StatSLRUEntry));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_StatSLRUEntry));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
@@ -1440,48 +1622,63 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    ts = GetCurrentTimestamp();
 
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Store the last autovacuum time in the database's hash table entry.
+     */
+    dbentry = get_local_dbstat_entry(dboid);
+    dbentry->last_autovac_time = ts;
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_TableStatus *tabentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+    tabentry = get_local_tabstat_entry(tableoid, shared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->t_counts.autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->t_counts.vacuum_count++;
+    }
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1689,10 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_TableStatus *tabentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1700,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,154 +1721,167 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    tabentry = get_local_tabstat_entry(RelationGetRelid(rel),
+                                       rel->rd_rel->relisshared);
+
+    tabentry->t_counts.t_delta_live_tuples = livetuples;
+    tabentry->t_counts.t_delta_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->t_counts.reset_changed_tuples = true;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->t_counts.analyze_count++;
+    }
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-
-/* --------
- * pgstat_report_checksum_failures_in_db() -
- *
- *    Tell the collector about one or more checksum failures.
- * --------
- */
-void
-pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
-{
-    PgStat_MsgChecksumFailure msg;
-
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
-
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 
-/* ----------
- * pgstat_ping() -
+/* --------
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Send some junk data to the collector to increase traffic.
- * ----------
+ *    Reports about one or more checksum failures.
+ * --------
  */
 void
-pgstat_ping(void)
+pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgDummy msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbentry = get_local_dbstat_entry(dboid);
+
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1685,26 +1896,15 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
     if (!found)
         MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
 
+    htabent->f_id = fcinfo->flinfo->fn_oid;
+
     fcu->fs = &htabent->f_counts;
 
     /* save stats for this function, later used to compensate for recursion */
@@ -1717,31 +1917,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +1988,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +1999,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1813,7 +2015,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1829,116 +2032,148 @@ pgstat_initstats(Relation rel)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
 }
 
-/*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
+    PgStatHashEntryKey key;
+    PgStatLocalHashEntry *entry;
 
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    if (pgStatLocalHash == NULL)
     {
         HASHCTL        ctl;
 
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
+        MemSet(&ctl, 0, sizeof(ctl));
+        ctl.keysize = sizeof(PgStatHashEntryKey);
+        ctl.entrysize = sizeof(PgStatLocalHashEntry);
 
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        pgStatLocalHash = hash_create("Local stat entries",
+                                      PGSTAT_TABLE_HASH_SIZE,
+                                      &ctl,
+                                      HASH_ELEM | HASH_BLOBS);
     }
 
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    entry = hash_search(pgStatLocalHash, &key,
+                        create ? HASH_ENTER : HASH_FIND, found);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAlloc(TopMemoryContext,
+                                         pgstat_localentsize[type]);
+
+    return entry->body;
+}
+
+/* ----------
+ * get_local_dbstat_entry() -
+ *
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
+ */
+static PgStat_StatDBEntry *
+get_local_dbstat_entry(Oid dbid)
+{
+    PgStat_StatDBEntry *dbentry;
+    bool        found;
+
     /*
      * Find an entry or create a new one.
      */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
+
     if (!found)
     {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
+        dbentry->last_autovac_time = 0;
+        dbentry->last_checksum_failure = 0;
+        dbentry->stat_reset_timestamp = 0;
+        dbentry->stats_timestamp = 0;
+        MemSet(&dbentry->counts, 0, sizeof(PgStat_StatDBCounts));
     }
 
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
+    return dbentry;
+}
 
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
 
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
+ */
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+    PgStat_TableStatus *tabentry;
+    bool        found;
+
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
+    
+    if (!found)
     {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
+        tabentry->t_id = rel_id;
+        tabentry->t_shared = isshared;
+        tabentry->trans = NULL;
+        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+        tabentry->vacuum_timestamp = 0;
+        tabentry->autovac_vacuum_timestamp = 0;
+        tabentry->analyze_timestamp = 0;
+        tabentry->autovac_analyze_timestamp = 0;
     }
 
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
+    return tabentry;
 }
 
+
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2369,8 +2604,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,7 +2650,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2451,7 +2686,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2471,85 +2706,112 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *snap;
+    PgStat_StatDBEntry *shent;
+    
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    snap = palloc(sizeof(PgStat_StatDBEntry));
+    memcpy(snap, shent, sizeof(PgStat_StatDBEntry));
+
+    return snap;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_snapshot(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_snapshot() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is cached
+ *    until transaction end or pgstat_clear_snapshot() is called.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_snapshot(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *snap;
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    snap = palloc(sizeof(PgStat_StatTabEntry));
+    memcpy(snap, shent, sizeof(PgStat_StatTabEntry));
+
+    return snap;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2563,25 +2825,25 @@ pgstat_fetch_stat_tabentry(Oid relid)
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
+    PgStat_StatFuncEntry *snap;
+    PgStat_StatFuncEntry *shent;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
 
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, func_id, true, false,
+                       NULL);
 
-    return funcentry;
+    if (!shent)
+        return NULL;
+
+    snap = palloc(sizeof(PgStat_StatFuncEntry));
+    memcpy(snap, shent, sizeof(PgStat_StatFuncEntry));
+
+    return snap;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2646,15 +2908,31 @@ pgstat_fetch_stat_numbackends(void)
  * pgstat_fetch_stat_archiver() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    a pointer to a palloc'ed memory.
  * ---------
  */
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_ArchiverStats *ret = palloc(sizeof(PgStat_ArchiverStats));
 
-    return &archiverStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+        
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&global_changecount);
+        memcpy(ret, &StatsShmem->archiver_stats, sizeof(PgStat_ArchiverStats));
+        after_count = pg_atomic_read_u32(&global_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    return ret;
 }
 
 
@@ -2663,15 +2941,31 @@ pgstat_fetch_stat_archiver(void)
  * pgstat_fetch_global() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    a pointer to a palloc'ed memory.
  * ---------
  */
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    PgStat_GlobalStats *ret = palloc(sizeof(PgStat_GlobalStats));
 
-    return &globalStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+        
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&global_changecount);
+        memcpy(ret, &StatsShmem->global_stats, sizeof(PgStat_GlobalStats));
+        after_count = pg_atomic_read_u32(&global_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    return ret;
 }
 
 
@@ -2683,12 +2977,31 @@ pgstat_fetch_global(void)
  *    a pointer to the slru statistics struct.
  * ---------
  */
-PgStat_SLRUStats *
+PgStat_StatSLRUEntry *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    PgStat_StatSLRUEntry *ret;
+    size_t size = sizeof(PgStat_StatSLRUEntry) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    ret = palloc(size);
+
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+        
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&global_changecount);
+        memcpy(ret, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&global_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+    
+    return ret;
 }
 
 
@@ -2902,8 +3215,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3079,12 +3392,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3097,7 +3413,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3114,6 +3430,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3374,7 +3692,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3669,8 +3988,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4324,94 +4643,82 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
- * ----------
- */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
-{
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    ((PgStat_MsgHdr *) msg)->m_size = len;
-
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
-    {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
+ *        Report archiver statistics
  * ----------
  */
 void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    PgStat_MsgArchiver msg;
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    TimestampTz now = GetCurrentTimestamp();
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    if (failed)
+    {
+        /* Failed archival attempt */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
+    else
+    {
+        /* Successful archival operation */
+        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+        assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+        pg_atomic_add_fetch_u32(&global_changecount, 1);
+        LWLockRelease(StatsLock);
+    }
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    PgStat_GlobalStats *s = &StatsShmem->global_stats;
+    PgStat_BgWriter       *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    assert_changecount = pg_atomic_fetch_add_u32(&global_changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+    s->buf_alloc += l->buf_alloc;
+    pg_atomic_add_fetch_u32(&global_changecount, 1);
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4419,477 +4726,31 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
-{
-    PgStat_StatDBEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
-     */
-    if (!found)
-        reset_dbentry_counters(result);
-
-    return result;
-}
-
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
- */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
-{
-    PgStat_StatTabEntry *result;
-    bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
-
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
-}
-
 
 /* ----------
  * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
- *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ *        Write the global statistics file, as well as DB files.
  * ----------
  */
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    PgStatLocalHashEntry *entlist;
+    PgStatLocalHashEntry *p;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
+    create_missing_dbentries();
+
     /*
      * Open the statistics temp file to write out the current values.
      */
@@ -4906,7 +4767,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    StatsShmem->global_stats.stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4918,48 +4779,52 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->global_stats, sizeof(PgStat_GlobalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_ArchiverStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    entlist = collect_stat_entries(PGSTAT_TYPE_DB, InvalidOid);
+    for (p = entlist; p->body != NULL; p++)
     {
+        PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) &p->body;
+
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = StatsShmem->global_stats.stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_database_stats(p->key.databaseid);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
          * pointers, since they're of no use to any other process.
          */
         fputc('D', fpout);
+        rc = fwrite(&p->key.databaseid, sizeof(Oid), 1, fpout);
         rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
 
+    pfree(entlist);
+
     /*
      * No more output to be done. Close the temp file and replace the old
      * pgstat.stat with it.  The ferror() check replaces testing for error
@@ -4992,64 +4857,27 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
 }
 
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
 
 /* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
+ * pgstat_write_database_stats() -
+ *  Write the stat file for a single database.
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(Oid dbid)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
+    PgStatLocalHashEntry *entlist;
+    PgStatLocalHashEntry *p;
     FILE       *fpout;
     int32        format_id;
-    Oid            dbid = dbentry->databaseid;
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5076,24 +4904,32 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    entlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbid);
+    for (p = entlist; p->body != NULL; p++)
     {
+        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &p->body;
+
         fputc('T', fpout);
+        rc = fwrite(&p->key.objectid, sizeof(Oid), 1, fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(entlist);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    entlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbid);
+    for (p = entlist; p->body != NULL; p++)
     {
+        PgStat_StatFuncEntry *funcentry = (PgStat_StatFuncEntry *) &p->body;
+
         fputc('F', fpout);
+        rc = fwrite(&p->key.objectid, sizeof(Oid), 1, fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    pfree(entlist);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5127,102 +4963,196 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
+}
+
+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
+    HTAB       *oidhash;
+    HASHCTL        ctl;
+    HASH_SEQ_STATUS scan;
+    Oid           *poid;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    ctl.hcxt = CurrentMemoryContext;
+    oidhash = hash_create("Temporary table of OIDs",
+                          PGSTAT_TABLE_HASH_SIZE,
+                          &ctl,
+                          HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+    /* Collect OID from the shared stats hash */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+        hash_search(oidhash, &p->key.databaseid, HASH_ENTER, NULL);
+    dshash_seq_term(&hstat);
+
+    /* Create missing database entries if not exists. */
+    hash_seq_init(&scan, oidhash);
+    while ((poid = (Oid *) hash_seq_search(&scan)) != NULL)
+        (void) get_stat_entry(PGSTAT_TYPE_DB, *poid, InvalidOid,
+                              false, true, NULL);
+
+    hash_destroy(oidhash);
+}
+
+
+/* ----------
+ * get_stat_entry() -
+ *
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashEntryKey        key;
+    bool                    myfound;
 
-    if (permanent)
+    Assert(type != PGSTAT_TYPE_ALL);
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
     {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+        uint64 currage;
+
+        currage = pg_atomic_read_u64(&StatsShmem->del_ent_count);
 
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
+        if (currage == pgStatEntHashAge)
+        {
+            lohashent = pgstat_cache_lookup(pgStatEntHash, key);
+
+            if (lohashent)
+            {
+                if (found)
+                    *found = true;
+                return lohashent->body;
+            }
+        }
+        else
+        {
+            pgstat_cache_destroy(pgStatEntHash);
+            pgStatEntHash = NULL;
+        }
     }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &myfound);
+    if (shhashent)
+    {
+        if (create && !myfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || CacheMemoryContext)
+        {
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_cache_create(CacheMemoryContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->del_ent_count);
+            }
+    
+            lohashent =
+                pgstat_cache_insert(pgStatEntHash, key, &myfound);
+            Assert(!myfound);
+            lohashent->body = shheader;
+        }
+    }
+
+    if (found)
+        *found = myfound;
+
+    return shheader;
 }
 
+
 /* ----------
  * pgstat_read_statsfiles() -
  *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ *    Reads in existing activity statistics files into the shared stats hash.
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
- *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->global_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->global_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5231,7 +5161,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
@@ -5239,52 +5169,46 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(&StatsShmem->global_stats, 1, sizeof(PgStat_GlobalStats), fpin) !=
+        sizeof(PgStat_GlobalStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(&StatsShmem->global_stats, 0, sizeof(PgStat_GlobalStats));
         goto done;
     }
 
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_ArchiverStats), 
+              fpin) != sizeof(PgStat_ArchiverStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_ArchiverStats));
         goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
+        Oid dbid;
+
         switch (fgetc(fpin))
         {
                 /*
@@ -5292,10 +5216,17 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * follows.
                  */
             case 'D':
+                if (fread(&dbid, sizeof(Oid), 1, fpin) != sizeof(Oid))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
                 if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
                           fpin) != offsetof(PgStat_StatDBEntry, tables))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5304,76 +5235,32 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+
+                dbentry = (PgStat_StatDBEntry *)
+                    get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                                   false, true, &found);
+
+                /* don't allow duplicate dbentries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
+                memcpy(dbentry, &dbbuf,
+                       offsetof(PgStat_StatDBEntry, tables));
 
+                /* Read the data from the database-specific file. */
+                pgstat_read_db_statsfile(dbid);
                 break;
 
             case 'E':
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5383,34 +5270,22 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the at-rest statistics file and create shared statistics
+ *    tables. The file is removed after reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid dbid)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5421,21 +5296,21 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
@@ -5448,45 +5323,49 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
     for (;;)
     {
+        Oid objid;
+
         switch (fgetc(fpin))
         {
                 /*
                  * 'T'    A PgStat_StatTabEntry follows.
                  */
             case 'T':
+                if (fread(&objid, 1, sizeof(Oid), fpin) != sizeof(Oid))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
                 if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
                           fpin) != sizeof(PgStat_StatTabEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    get_stat_entry(PGSTAT_TYPE_TABLE, dbid, objid,
+                                   false, true, &found);
 
+                /* don't allow duplicate entries */
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5499,28 +5378,29 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                  * 'F'    A PgStat_StatFuncEntry follows.
                  */
             case 'F':
+                if (fread(&objid, 1, sizeof(Oid), fpin) != sizeof(Oid))
+                {
+                    ereport(LOG,
+                            (errmsg("corrupted statistics file \"%s\"",
+                                    statfile)));
+                    goto done;
+                }
                 if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
                           fpin) != sizeof(PgStat_StatFuncEntry))
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
                 }
 
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    get_stat_entry(PGSTAT_TYPE_TABLE, dbid, objid,
+                                   false, true, &found);
 
                 if (found)
                 {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
+                    ereport(LOG,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
                     goto done;
@@ -5536,7 +5416,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 goto done;
 
             default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
+                ereport(LOG,
                         (errmsg("corrupted statistics file \"%s\"",
                                 statfile)));
                 goto done;
@@ -5545,308 +5425,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 
 done:
     FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
-    }
-
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
-    {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
-
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
-
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
-
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
-    }
-
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
-
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
-}
-
-
-/* ----------
- * pgstat_setup_memcxt() -
- *
- *    Create pgStatLocalContext, if not already done.
- * ----------
- */
-static void
-pgstat_setup_memcxt(void)
-{
-    if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 
@@ -5865,795 +5445,14 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
     }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
 }
 
 /*
@@ -6744,7 +5543,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_StatSLRUEntry *
 slru_entry(int slru_idx)
 {
     /*
@@ -6755,7 +5554,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6765,41 +5564,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..61fa52ed66 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2045,7 +2045,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                BgWriterStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2155,7 +2155,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2345,7 +2345,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2353,7 +2353,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 95738a4e34..3e8ce0b3bf 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1710,7 +1707,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
     int            i;
-    PgStat_SLRUStats *stats;
+    PgStat_StatSLRUEntry *stats;
 
     /* check to see if caller supports us returning a tuplestore */
     if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -1744,7 +1741,7 @@ pg_stat_get_slru(PG_FUNCTION_ARGS)
         /* for each row */
         Datum        values[PG_STAT_GET_SLRU_COLS];
         bool        nulls[PG_STAT_GET_SLRU_COLS];
-        PgStat_SLRUStats stat = stats[i];
+        PgStat_StatSLRUEntry    stat = stats[i];
         const char *name;
 
         name = pgstat_slru_name(i);
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..29eb459e35 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..396ecdb53f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -113,18 +85,17 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_delta_live_tuples;
     PgStat_Counter t_delta_dead_tuples;
     PgStat_Counter t_changed_tuples;
+    bool        reset_changed_tuples;
 
     PgStat_Counter t_blocks_fetched;
     PgStat_Counter t_blocks_hit;
+
+    PgStat_Counter vacuum_count;
+    PgStat_Counter autovac_vacuum_count;
+    PgStat_Counter analyze_count;
+    PgStat_Counter autovac_analyze_count;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -158,6 +129,10 @@ typedef struct PgStat_TableStatus
     Oid            t_id;            /* table's OID */
     bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
 } PgStat_TableStatus;
 
@@ -183,308 +158,32 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
 /* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgTempFile
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter buf_alloc;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+}            PgStat_BgWriter;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -516,98 +215,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +225,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +237,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +246,83 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member, 
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_StatSLRUEntry
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_StatSLRUEntry;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member, 
+                                       used only on shared memory  */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,34 +342,29 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member, 
+                                       used only on shared memory  */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
 /*
- * Archiver statistics kept in the stats collector
+ * Archiver statistics kept in the shared stats
  */
 typedef struct PgStat_ArchiverStats
 {
@@ -727,7 +380,7 @@ typedef struct PgStat_ArchiverStats
 } PgStat_ArchiverStats;
 
 /*
- * Global statistics kept in the stats collector
+ * Global statistics kept in the shared stats
  */
 typedef struct PgStat_GlobalStats
 {
@@ -745,21 +398,6 @@ typedef struct PgStat_GlobalStats
     TimestampTz stat_reset_timestamp;
 } PgStat_GlobalStats;
 
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 
 /* ----------
  * Backend states
@@ -808,7 +446,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1060,7 +698,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1257,13 +895,15 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +918,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
 
+/* File input/output functions  */
+extern void pgstat_read_statsfiles(void);
+extern void pgstat_write_statsfiles(void);
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1462,8 +1099,8 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,13 +1109,16 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_snapshot(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
 extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
 extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_SLRUStats *pgstat_fetch_slru(void);
+extern PgStat_StatSLRUEntry *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1489,5 +1129,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 871fae3b2b190c12ec623ec8215eb52d13d9b3ff Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v37 5/6] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index de9bacd34f..69db5afc94 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9209,9 +9209,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..cc5dc1173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7192,11 +7192,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7212,14 +7212,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7250,9 +7249,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8313,7 +8312,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8325,9 +8324,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index beb309e668..45ec7ce68f 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2366,12 +2366,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..7a04d58a1a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +610,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1057,10 +1047,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1815,6 +1801,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5738,9 +5728,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index e09ed0a4c3..71bb24accf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1290,11 +1290,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From c146335c48d9f9d0740fc3137dc99477b16dc77e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 17:00:44 +0900
Subject: [PATCH v37 6/6] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.
---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 ---------
 doc/src/sgml/storage.sgml                     |  3 +-
 src/backend/postmaster/pgstat.c               | 13 +++---
 src/backend/replication/basebackup.c          | 13 ++----
 src/backend/utils/misc/guc.c                  | 41 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/pgstat.h                          |  5 ++-
 src/test/perl/PostgresNode.pm                 |  4 --
 9 files changed, 13 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b9331830f7..5096963234 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cc5dc1173f..d8d99bb546 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7305,25 +7305,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..559f75fb54 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -122,8 +122,7 @@ Item
 
 <row>
  <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
+ <entry>Subdirectory containing ephemeral files for extensions</entry>
 </row>
 
 <row>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1cd4cb20b6..68b5745bf9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -98,15 +98,12 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
+/*
+ * This used to be a GUC variable and is no longer used in this file, but left
+ * alone just for backward compatibility for extensions, having the default
+ * value.
  */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char       *pgstat_stat_directory = PG_STAT_TMP_DIR;
 
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..f03720fa48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -261,7 +261,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +283,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -318,13 +315,9 @@ perform_base_backup(basebackup_options *opt)
          * Calculate the relative path of temporary statistics directory in
          * order to skip the files which are located in that directory later.
          */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
+
+        Assert(strchr(PG_STAT_TMP_DIR, '/') == NULL);
+        statrelpath = psprintf("./%s", PG_STAT_TMP_DIR);
 
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 29eb459e35..467f9299b3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -4309,17 +4308,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11608,35 +11596,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 396ecdb53f..0b41156b3c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,7 +32,10 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
+/*
+ * This used to be the directory to store temporary statistics data in but is
+ * no longer used. Defined here for backward compatibility.
+ */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..bb5474b878 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 25 Sep 2020 09:27:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Thanks for reviewing!
> 
> At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:
> > > Locks on the shared statistics is acquired by the units of such like
> > > tables, functions so the expected chance of collision are not so high.
> > 
> > I can't really parse that...
> 
> Mmm... Is the following readable?
> 
> Shared statistics locks are acquired by units such as tables,
> functions, etc., so the chances of an expected collision are not so
> high.
> 
> Anyway, this is found to be wrong, so I removed it.

01: (Fixed?)
> > > Furthermore, until 1 second has elapsed since the last flushing to
> > > shared stats, lock failure postpones stats flushing so that lock
> > > contention doesn't slow down transactions.
> > 
> > I think I commented on that before, but to me 1s seems way too low to
> > switch to blocking lock acquisition. What's the reason for such a low
> > limit?
> 
> It was 0.5 seconds previously.  I don't have a clear idea of a
> reasonable value for it. One possible rationale might be to have 1000
> clients each have a writing time slot of 10ms.. So, 10s as the minimum
> interval. I set maximum interval to 60, and retry interval to
> 1s. (Fixed?)

02: (I'd appreciate it if you could suggest the appropriate one.)
> > >      /*
> > > -     * Clean up any dead statistics collector entries for this DB. We always
> > > +     * Clean up any dead activity statistics entries for this DB. We always
> > >       * want to do this exactly once per DB-processing cycle, even if we find
> > >       * nothing worth vacuuming in the database.
> > >       */
> > 
> > What is "activity statistics"?
> 
> I don't get your point. It is formally the replacement word for
> "statistics collector". The "statistics collector (process)" no longer
> exists, so I had to invent a name for the successor mechanism that is
> distinguishable with data/column statistics.  If it is not the proper
> wording, I'd appreciate it if you could suggest the appropriate one.

03: (Fixed. Replaced with far simpler cache implement.)
> > > @@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
> > >      }
> > >
> > >      /* fetch the pgstat table entry */
> > > -    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
> > > -                                         shared, dbentry);
> > > +    tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
> > > +                                                   relid);
> > 
> > Why do all of these places deal with a snapshot? For most it seems to
> > make much more sense to just look up the entry and then copy that into
> > local memory?  There may be some place that need some sort of snapshot
> > behaviour that's stable until commit / pgstat_clear_snapshot(). But I
> > can't reallly see many?
> 
> Ok, I reread this thread and agree that there's a (vague) consensus to
> remove the snapshot stuff. Backend-statistics (bestats) still are
> stable during a transaction.

If we nuked the snapshot stuff completely, pgstatfuns.c needed many
additional pfree()s since it calls pgstat_fetch* many times for the
same object.  I choosed to make pgstat_fetch_stat_*() functions return
a result stored in static variables. It doesn't work a transactional
way as before but keeps the last result for a while then invalidated
by transaction end time at most.

04: (Fixed.)
> > > +#define PGSTAT_MIN_INTERVAL            1000    /* Minimum interval of stats data
> > >
> > > +#define PGSTAT_MAX_INTERVAL            10000    /* Longest interval of stats data
> > > +                                             * updates */
> > 
> > These don't really seem to be in line with the commit message...
> 
> Oops! Sorry. Fixed both of this value and the commit message (and the
> file comment).

05: (The struct is gone.)
> > > + * dshash pgStatSharedHash
> > > + *    -> PgStatHashEntry                (dshash entry)
> > > + *      (dsa_pointer)-> PgStatEnvelope    (dsa memory block)
> > 
> > I don't like 'Envelope' that much. If I understand you correctly that's
> > a common prefix that's used for all types of stat objects, correct? If
> > so, how about just naming it PgStatEntryBase or such? I think it'd also
> > be useful to indicate in the "are stored as" part that PgStatEnvelope is
> > just the common prefix for an allocation.
> 
> The name makes sense. Thanks! (But the struct is now gone..)

06: (Fixed.)
> > > -typedef struct TabStatHashEntry
> > > +static size_t pgstat_entsize[] =
> > 
> > > +/* Ditto for local statistics entries */
> > > +static size_t pgstat_localentsize[] =
> > > +{
> > > +    0,                            /* PGSTAT_TYPE_ALL: not an entry */
> > > +    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
> > > +    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
> > > +    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
> > > +};
> > 
> > These probably should be const as well.
> 
> Right. Fixed.

07: (Fixed.)
> > >  /*
> > > - * Backends store per-function info that's waiting to be sent to the collector
> > > - * in this hash table (indexed by function OID).
> > > + * Stats numbers that are waiting for flushing out to shared stats are held in
> > > + * pgStatLocalHash,
> > >   */
> > > -static HTAB *pgStatFunctions = NULL;
> > > +typedef struct PgStatHashEntry
> > > +{
> > > +    PgStatHashEntryKey key;        /* hash key */
> > > +    dsa_pointer env;            /* pointer to shared stats envelope in DSA */
> > > +}            PgStatHashEntry;
> > > +
> > > +/* struct for shared statistics entry pointed from shared hash entry. */
> > > +typedef struct PgStatEnvelope
> > > +{
> > > +    PgStatTypes type;            /* statistics entry type */
> > > +    Oid            databaseid;        /* databaseid */
> > > +    Oid            objectid;        /* objectid */
> > 
> > Do we need this information both here and in PgStatHashEntry? It's
> > possible that it's worthwhile, but I am not sure it is.
> 
> Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
> and PgStat_Stats*Entry. And I thought the same while developing. After
> some thoughts, I managed to remove the duplicate values other than
> PgStat(Local)HashEntry. Fixed.

08: (Fixed.)
> > > +    size_t        len;            /* length of body, fixed per type. */
> > 
> > Why do we need this? Isn't that something that can easily be looked up
> > using the type?
> 
> Not only they are virtually fixed values, but they were found to be
> write-only variables. Removed.

09: (Fixed. "Envelope" is embeded in stats entry structs.)
> > > +    LWLock        lock;            /* lightweight lock to protect body */
> > > +    int            body[FLEXIBLE_ARRAY_MEMBER];    /* statistics body */
> > > +}            PgStatEnvelope;
> > 
> > What you're doing here with 'body' doesn't provide enough guarantees
> > about proper alignment. E.g. if one of the entry types wants to store a
> > double, this won't portably work, because there's platforms that have 4
> > byte alignment for ints, but 8 byte alignment for doubles.
> > 
> > 
> > Wouldn't it be better to instead embed PgStatEnvelope into the struct
> > that's actually stored? E.g. something like
> > 
> > struct PgStat_TableStatus
> > {
> >     PgStatEnvelope header; /* I'd rename the type */
> >     TimestampTz vacuum_timestamp;    /* user initiated vacuum */
> >     ...
> > }
> > 
> > or if you don't want to do that because it'd require declaring
> > PgStatEnvelope in the header (not sure that'd really be worth avoiding),
> > you could just get rid of the body field and just do the calculation
> > using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))
> 
> As the result of the modification so far, there is only one member,
> lock, left in the PgStatEnvelope (or PgStatEntryBase) struct.  I chose
> to embed it to each PgStat_Stat*Entry structs as
> PgStat_StatEntryHeader.


10: (Fixed. Same as #03)
> > > + * Snapshot is stats entry that is locally copied to offset stable values for a
> > > + * transaction.
> ...
> > The amount of code needed for this snapshot stuff seems unreasonable to
> > me, especially because I don't see why we really need it. Is this just
> > so that there's no skew between all the columns of pg_stat_all_tables()
> > etc?
> > 
> > I think this needs a lot more comments explaining what it's trying to
> > achieve.
> 
> I don't insist on keeping the behavior.  Removed snapshot stuff only
> of pgstat stuff. (beentry snapshot is left alone.)

11: (Fixed. Per-entry-type initialize is gone.)
> > > +/*
> > > + * Newly created shared stats entries needs to be initialized before the other
> > > + * processes get access it. get_stat_entry() calls it for the purpose.
> > > + */
> > > +typedef void (*entry_initializer) (PgStatEnvelope * env);
> > 
> > I think we should try to not need it, instead declaring that all fields
> > are zero initialized. That fits well together with my suggestion to
> > avoid duplicating the database / object ids.
> 
> Now that entries don't have type-specific fields that need a special
> care, I removed that stuff altogether.

12: (Fixed. Global stats memories are merged.)
> > > +static void
> > > +attach_shared_stats(void)
> > > +{
> ...
> > > +        shared_globalStats = (PgStat_GlobalStats *)
> > > +            dsa_get_address(area, StatsShmem->global_stats);
> > > +        shared_archiverStats = (PgStat_ArchiverStats *)
> > > +            dsa_get_address(area, StatsShmem->archiver_stats);
> > > +
> > > +        shared_SLRUStats = (PgStatSharedSLRUStats *)
> > > +            dsa_get_address(area, StatsShmem->slru_stats);
> > > +        LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
> > 
> > I don't think it makes sense to use dsa allocations for any of the fixed
> > size stats (global_stats, archiver_stats, ...). They should just be
> > direct members of StatsShmem? Then we also don't need the shared_*
> > helper variables
> 
> I intended to reduce the amount of fixed-allocated shared memory, or
> make maximum use of DSA. However, you're right. Now they are members
> of StatsShmem.


13: (I couldn't address this fully..)
> > > +        /* Load saved data if any. */
> > > +        pgstat_read_statsfiles();
> > 
> > Hm. Is it a good idea to do this as part of the shmem init function?
> > That's a lot more work than we normally do in these.
> > 
> > > +/* ----------
> > > + * detach_shared_stats() -
> > > + *
> > > + *    Detach shared stats. Write out to file if we're the last process and told
> > > + *    to do so.
> > > + * ----------
> > >   */
> > >  static void
> > > -pgstat_reset_remove_files(const char *directory)
> > > +detach_shared_stats(bool write_stats)
> > 
> > I think it'd be better to have an explicit call in the shutdown sequence
> > somewhere to write out the data, instead of munging detach and writing
> > stats out together.
> 
> It is actually strange that attach_shared_stats reads file in a
> StatsLock section while it attaches existing shared memory area
> deliberately outside the same lock section. So I moved the call to
> pg_stat_read/write_statsfiles() out of StatsLock section as the first
> step. But I couldn't move pgstat_write_stats_files() out of (or,
> before or after) detach_shared_stats(), because I didn't find a way to
> reliably check if the exiting process is the last detacher by a
> separate function from detach_shared_stats().
> 
> (continued)
> =====

14: (I believe it is addressed.)
> > +    if (nowait)
> > +    {
> > +        /*
> > +         * Don't flush stats too frequently.  Return the time to the next
> > +         * flush.
> > +         */
> 
> I think it's confusing to use nowait in the if when you actually mean
> !force.

Agreed.  I'm hovering between using !force to the parameter "nowait"
of flush_tabstat() or using the relabeled variable nowait.  I choosed
to use nowait in the attached.

15: (Not addressed.)
> > -    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
> > +    if (pgStatLocalHash)
> >      {
> > -        for (i = 0; i < tsa->tsa_used; i++)
> > +        /* Step 1: flush out other than database stats */
...
> > +                case PGSTAT_TYPE_DB:
> > +                    if (ndbentries >= dbentlistlen)
> > +                    {
> > +                        dbentlistlen *= 2;
> > +                        dbentlist = repalloc(dbentlist,
> > +                                             sizeof(PgStatLocalHashEntry *) *
> > +                                             dbentlistlen);
> > +                    }
> > +                    dbentlist[ndbentries++] = lent;
> > +                    break;
> 
> Why do we need this special behaviour for database statistics?

Some of the table stats numbers are also counted as database stats
numbers. It is currently added at stats-sending time (in
pgstat_recv_tabstat()) and this follows that design.  If we add such
table stats numbers to database stats before flushing out table stats,
we need to remember whether that number are already added to database
stats or not yet.

16: (Fixed. Used List.)
> If we need it,it'd be better to just use List here rather than open
> coding a replacement (List these days basically has the same complexity
> as what you do here).

Agreed. (I noticed that lappend is faster than lcons now.) Fixed.

17: (Fixed. case-default is removed, and PGSTAT_TYPE_ALL is removed by #28)
> > +                case PGSTAT_TYPE_TABLE:
> > +                    if (flush_tabstat(lent->env, nowait))
> > +                        remove = true;
> > +                    break;
> > +                case PGSTAT_TYPE_FUNCTION:
> > +                    if (flush_funcstat(lent->env, nowait))
> > +                        remove = true;
> > +                    break;
> > +                default:
> > +                    Assert(false);
> 
> Adding a default here prevents the compiler from issuing a warning when
> new types of stats are added...

Agreed. Another instance of switch on the same enum doesn't have
default:. (Fixed.)

18: (Fixed.)
> > +            /* Remove the successfully flushed entry */
> > +            pfree(lent->env);
> 
> Probably worth zeroing the pointer here, to make debugging a little
> easier.

Agreed. I did the same to another instance of freeing a memory chunk
pointed from non-block-local pointers.

19: (Fixed. LWLocks is replaced with atmoic update.)
> > +    /* Publish the last flush time */
> > +    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +    if (shared_globalStats->stats_timestamp < now)
> > +        shared_globalStats->stats_timestamp = now;
> > +    LWLockRelease(StatsLock);
> 
> Ugh, that seems like a fairly unnecessary global lock acquisition. What
> do we need this timestamp for? Not clear to me that it's still
> needed. If it is needed, it'd probably worth making this an atomic and
> doing a compare-exchange loop instead.

The value is exposed via a system view. I used pg_atomic but I didn't
find a clean way to store TimestampTz into pg_atomic_u64.

20: (Wrote a comment to explain the reason.)
> >      /*
> > -     * Send partial messages.  Make sure that any pending xact commit/abort
> > -     * gets counted, even if there are no table stats to send.
> > +     * If we have pending local stats, let the caller know the retry interval.
> >       */
> > -    if (regular_msg.m_nentries > 0 ||
> > -        pgStatXactCommit > 0 || pgStatXactRollback > 0)
> > -        pgstat_send_tabstat(®ular_msg);
> > -    if (shared_msg.m_nentries > 0)
> > -        pgstat_send_tabstat(&shared_msg);
> > +    if (HAVE_ANY_PENDING_STATS())
> 
> I think this needs a comment explaining why we still may have pending
> stats.

Does the following work?

| * Some of the local stats may have not been flushed due to lock
| * contention.  If we have such pending local stats here, let the caller
| * know the retry interval.

21: (Fixed. Local cache of shared stats entry is added.)
> > + * flush_tabstat - flush out a local table stats entry
> > + *
...
> Could we cache the address of the shared entry in the local entry for a
> while? It seems we have a bunch of contention (that I think you're
> trying to address in a prototoype patch posted since) just because we
> will over and over look up the same address in the shared hash table.
> 
> If we instead kept the local hashtable alive for longer and stored a
> pointer to the shared entry in it, we could make this a lot
> cheaper. There would be some somewhat nasty edge cases probably. Imagine
> a table being dropped for which another backend still has pending
> stats. But that could e.g. be addressed with a refcount.

Yeah, I noticed that and did that in the previous version (with a
silly bug..)  The cache is based on the simple hash. All the entries
were dropped after a vacuum removed at least one shared stats entry in
the previous version. However, this version uses refcount and drops
only the entries actually needed to be dropped.

22: (vacuum/analyze immediately writes to shared stats according to #34)
> > +    /* retrieve the shared table stats entry from the envelope */
> > +    shtabstats = (PgStat_StatTabEntry *) &shenv->body;
> > +
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;
> > +
> > +    /* add the values to the shared entry. */
> > +    shtabstats->numscans += lstats->t_counts.t_numscans;
> > +    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
> > +    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
> > +    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
> > +    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
> > +    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
> > +    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
> > +
> > +    /*
> > +     * If table was truncated or vacuum/analyze has ran, first reset the
> > +     * live/dead counters.
> > +     */
> > +    if (lstats->t_counts.t_truncated ||
> > +        lstats->t_counts.vacuum_count > 0 ||
> > +        lstats->t_counts.analyze_count > 0 ||
> > +        lstats->t_counts.autovac_vacuum_count > 0 ||
> > +        lstats->t_counts.autovac_analyze_count > 0)
> > +    {
> > +        shtabstats->n_live_tuples = 0;
> > +        shtabstats->n_dead_tuples = 0;
> > +    }
> 
> > +    /* clear the change counter if requested */
> > +    if (lstats->t_counts.reset_changed_tuples)
> > +        shtabstats->changes_since_analyze = 0;
> 
> I know this is largely old code, but it's not obvious to me that there's
> no race conditions here / that the race condition didn't get worse. What
> prevents other backends to since have done a lot of inserts into this
> table? Especially in case the flushes were delayed due to lock
> contention.

# I noticed that I carelessly dropped inserts_since_vacuum code.

Well. if vacuum report is delayed after a massive insert commit, the
massive insert would be omitted. It seems to me that your suggestion
in #34 below gets the point.

> > +    /*
> > +     * Update vacuum/analyze timestamp and counters, so that the values won't
> > +     * goes back.
> > +     */
> > +    if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
> > +        shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
> 
> It seems to me that if these branches are indeed a necessary branches,
> my concerns above are well founded...

I'm not sure it is simply a talisman against evil or basing on an
actual trouble, but I don't believe it's possible that a vacuum ends
after another vacuum that started later ends...

23: (ids are no longer stored in duplicate.)
> > +init_tabentry(PgStatEnvelope * env)
> >  {
> > -    int            n;
> > -    int            len;
> > +    PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
> > +
> > +    /*
> > +     * If it's a new table entry, initialize counters to the values we just
> > +     * got.
> > +     */
> > +    Assert(env->type == PGSTAT_TYPE_TABLE);
> > +    tabent->tableid = env->objectid;
> 
> It seems over the top to me to have the object id stored in yet another
> place. It's now in the hash entry, in the envelope, and the type
> specific part.

Agreed, and fixed. (See #11 above)

24: (Fixed. Don't check for all-zero of a function stats entry at flush.)
> > +/*
> > + * flush_funcstat - flush out a local function stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
> > + * this function always returns true.
> > + *
> > + * Returns true if the entry is successfully flushed out.
> > + */
> > +static bool
> > +flush_funcstat(PgStatEnvelope * env, bool nowait)
> > +{
> > +    /* we assume this inits to all zeroes: */
> > +    static const PgStat_FunctionCounts all_zeroes;
> > +    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
> > +    PgStatEnvelope *shenv;        /* shared stats envelope */
> > +    PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
> > +    bool        found;
> > +
> > +    Assert(env->type == PGSTAT_TYPE_FUNCTION);
> > +    localent = (PgStat_BackendFunctionEntry *) &env->body;
> > +
> > +    /* Skip it if no counts accumulated for it so far */
> > +    if (memcmp(&localent->f_counts, &all_zeroes,
> > +               sizeof(PgStat_FunctionCounts)) == 0)
> > +        return true;
> 
> Why would we have an entry in this case?

Right. A function entry was zeroed out in master but the entry is not
created in that case with this patch. Removed it. (Fixed)

25: (Perhaps fixed. I'm not confident, though.)
> > +    /* find shared table stats entry corresponding to the local entry */
> > +    shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
> > +                           nowait, init_funcentry, &found);
> > +    /* skip if dshash failed to acquire lock */
> > +    if (shenv == NULL)
> > +        return false;            /* failed to acquire lock, skip */
> > +
> > +    /* retrieve the shared table stats entry from the envelope */
> > +    sharedent = (PgStat_StatFuncEntry *) &shenv->body;
> > +
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;            /* failed to acquire lock, skip */
> 
> It doesn't seem great that we have a separate copy of all of this logic
> again. It seems to me that most of the code here is (or should be)
> exactly the same as in table case. I think only the the below should be
> in here, rather than in common code.

I failed to get the last phrase, but I guess you suggested that I
should factor-out the common code.

> > +/*
> > + * flush_dbstat - flush out a local database stats entry
> > + *
> > + * If nowait is true, this function returns false on lock failure. Otherwise
...
> > +    /* lock the shared entry to protect the content, skip if failed */
> > +    if (!nowait)
> > +        LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
> > +    else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
> > +        return false;
> 
> Dito re duplicating all of this.


26: (Fixed. Now all stats are saved in one file.)
> > +/*
> > + * Create the filename for a DB stat file; filename is output parameter points
> > + * to a character buffer of length len.
> > + */
> > +static void
> > +get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
> > +{
> > +    int            printed;
> > +
> > +    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
> > +    printed = snprintf(filename, len, "%s/db_%u.%s",
> > +                       PGSTAT_STAT_PERMANENT_DIRECTORY,
> > +                       databaseid,
> > +                       tempname ? "tmp" : "stat");
> > +    if (printed >= len)
> > +        elog(ERROR, "overlength pgstat path");
> >  }
> 
> Do we really want database specific storage after all of these changes?
> Seems like there's no point anymore?

Sounds reasonable. Since we no longer keep the file format,
pgstat_read/write_statsfiles() gets far simpler. (Fixed)

27: (Fixed. added CFI to the same kind of loops.)
> > +    dshash_seq_init(&hstat, pgStatSharedHash, false);
> > +    while ((p = dshash_seq_next(&hstat)) != NULL)
> >      {
> > -        Oid            tabid = tabentry->tableid;
> > -
> > -        CHECK_FOR_INTERRUPTS();
> > -
> 
> Given that this could take a while on a database with a lot of objects
> it might worth keeping the CHECK_FOR_INTERRUPTS().

Agreed. It seems like a mistake. (Fixed  pstat_read/write_statsfile()).

28: (Fixed. collect_stat_entries is removed along with PGSTAT_TYPE_ALL.)
> >  /* ----------
> > - * pgstat_vacuum_stat() -
> > + * collect_stat_entries() -
> >   *
> > - *    Will tell the collector about objects he can get rid of.
> > + *    Collect the shared statistics entries specified by type and dbid. Returns a
> > + *  list of pointer to shared statistics in palloc'ed memory. If type is
> > + *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
> > + *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
> > + *  PGSTAT_TYPE_DB entries.
> >   * ----------
> >   */
> > -void
> > -pgstat_vacuum_stat(void)
> > +static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
> >  {
> 
> > -        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
> > +        if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
> > +            (type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
> >              continue;
> 
> I don't like this interface much. Particularly not that it requires
> adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
> where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
> nice either.

Sounds reasonable. It was annoying that dbid=InvalidOid is a valid
value for this interface. But now that the function is called only
from two places and it is now simpler to use dshash seqscan
directly. The function and the enum item PGSTAT_TYPE_ALL are gone.
(Fixed)

29: (Fixed. collect_stat_entries is gone.)
> > +        if (n >= listlen - 1)
> > +            listlen *= 2;
> > +            envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
> > +        envlist[n++] = dsa_get_address(area, p->env);
> >      }
> 
> I'd use List here as well.

So the function no longer exists. (Fixed)

30: (...)
> > +    dshash_seq_term(&hstat);
> 
> Hm, I didn't immediately see which locking makes this safe? Is it just
> that nobody should be attached at this point?

I'm not sure I get your point, but I try to elaborate.

All the callers of collect_stat_entries have been replaced with a bare
loop of dshash_seq_next.

There are two levels of lock here. One is dshash partition lock that
is needed to continue in-partition scan safely. Another is a lock of
stats entry that is pointed from a dshash entry.

---
((PgStatHashEntry) shent).body -(dsa_get_address)-+-> PgStat_StatEntryHeader
                                                  |
((PgStatLocalHashEntry) lent).body ---------------^
---

Dshash scans are used for dropping and resetting stats entries. Entry
dropping is performed in the following steps.

(delete_current_stats_entry())
- Drop the dshash entry (needs exlock of dshash partition).

- If refcount of the stats entry body is already zero, free the memory
   immediately .

- If not, set the "dropped" flag of the body. No lock is required
  because the "dropped" flag won't be even referred to by other
  backends until the next step is done.

- Increment deletion count of the shared hash. (this is used as the
  "age" of local pointer cache hash (pgstat_cache).

(get_stat_entry())

- If dshash deletion count is different from the local cache age, scan
  over the local cache hash to find "dropped" entries.

- Decrements refcount of the dropped entry and free the shared entry
  if it is no longer referenced. Apparently no lock is required.

pgstat_drop_database() and pgstat_vacuum_stat() have concurrent
backends so the locks above are required. pgstat_write_statsfile() is
guaranteed to run alone so it doesn't matter either taking locks or
not.

pgstat_reset_counters() doesn't drop or modify dshash entries so
dshash scan requires shared lock. The stats entry body is updated so it
needs exclusive lock.


31: (Fixed. Use List instead of the open coding.)
> > +void
> > +pgstat_vacuum_stat(void)
> > +{
...
> > +    /* collect victims from shared stats */
> > +    arraylen = 16;
> > +    victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
> > +    nvictims = 0;
> 
> Same List comment as before.

The function uses a list now. (Fixed)

32: (Fixed.)
> >  void
> >  pgstat_reset_counters(void)
> >  {
> > -    PgStat_MsgResetcounter msg;
> > +    PgStatEnvelope **envlist;
> > +    PgStatEnvelope **p;
> >
> > -    if (pgStatSock == PGINVALID_SOCKET)
> > -        return;
> > +    /* Lookup the entries of the current database in the stats hash. */
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
> > +    for (p = envlist; *p != NULL; p++)
> > +    {
> > +        PgStatEnvelope *env = *p;
> > +        PgStat_StatDBEntry *dbstat;
> >
> > -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
> > -    msg.m_databaseid = MyDatabaseId;
> > -    pgstat_send(&msg, sizeof(msg));
> > +        LWLockAcquire(&env->lock, LW_EXCLUSIVE);
> > +
> 
> What locking prevents this entry from being freed between the
> collect_stat_entries() and this LWLockAcquire?

Mmm. They're not protected.  The attached version no longer uses the
intermediate list and the fetched dshash entry is protected by dshash
partition lock.  (Fixed)


33: (Will keep the current code.)
> >  /* ----------
> > @@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
> >  void
> >  pgstat_report_autovac(Oid dboid)
> >  {
> > -    PgStat_MsgAutovacStart msg;
> > +    PgStat_StatDBEntry *dbentry;
> > +    TimestampTz ts;
> >
> > -    if (pgStatSock == PGINVALID_SOCKET)
> > +    /* return if activity stats is not active */
> > +    if (!area)
> >          return;
> >
> > -    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
> > -    msg.m_databaseid = dboid;
> > -    msg.m_start_time = GetCurrentTimestamp();
> > +    ts = GetCurrentTimestamp();
> >
> > -    pgstat_send(&msg, sizeof(msg));
> > +    /*
> > +     * Store the last autovacuum time in the database's hash table entry.
> > +     */
> > +    dbentry = get_local_dbstat_entry(dboid);
> > +    dbentry->last_autovac_time = ts;
> >  }
> 
> Why did you introduce the local ts variable here?

The function used to assign the timestamp within a LWLock section. In
the last version it writes to local entry so the lock was useless but
the amendment following to the comment #34 just below introduces
LWLocks again.

34: (Fixed. Vacuum/analyze write shared stats instantly.)
> >  /* --------
> >   * pgstat_report_analyze() -
> >   *
> > - *    Tell the collector about the table we just analyzed.
> > + *    Report about the table we just analyzed.
> >   *
> >   * Caller must provide new live- and dead-tuples estimates, as well as a
> >   * flag indicating whether to reset the changes_since_analyze counter.
> > @@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
> >                        PgStat_Counter livetuples, PgStat_Counter deadtuples,
> >                        bool resetcounter)
> >  {
> >  }
> 
> It seems to me that the analyze / vacuum cases would be much better
> dealth with by synchronously operating on the shared entry, instead of
> going through the local hash table. ISTM that that'd make it a lot

Blocking at the beginning and end of such operations doesn't
matter. Sounds reasonbale.

> going through the local hash table. ISTM that that'd make it a lot
> easier to avoid most of the ordering issues.

Agreed. That avoid at least the case of delayed vacuum report (#22).


35: (Fixed, needing a change of how relcache uses local stats.)
> > +static PgStat_TableStatus *
> > +get_local_tabstat_entry(Oid rel_id, bool isshared)
> > +{
> > +    PgStatEnvelope *env;
> > +    PgStat_TableStatus *tabentry;
> > +    bool        found;
> >
> > -    /*
> > -     * Now we can fill the entry in pgStatTabHash.
> > -     */
> > -    hash_entry->tsa_entry = entry;
> > +    env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
> > +                               isshared ? InvalidOid : MyDatabaseId,
> > +                               rel_id, true, &found);
> >
> > -    return entry;
> > +    tabentry = (PgStat_TableStatus *) &env->body;
> > +
> > +    if (!found)
> > +    {
> > +        tabentry->t_id = rel_id;
> > +        tabentry->t_shared = isshared;
> > +        tabentry->trans = NULL;
> > +        MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
> > +        tabentry->vacuum_timestamp = 0;
> > +        tabentry->autovac_vacuum_timestamp = 0;
> > +        tabentry->analyze_timestamp = 0;
> > +        tabentry->autovac_analyze_timestamp = 0;
> > +    }
> > +
> 
> As with shared entries, I think this should just be zero initialized
> (and we should try to get rid of the duplication of t_id/t_shared).

Ah! Yeah, they are removable since we already converted them into the
key of hash entry.  Removed oids and the intialization code from all
types of local stats entry types.

One annoyance doing that was pgstat_initstats, which assumes the
pgstat_info linked from relation won't be freed.  Finally I tightned
up the management of pgstat_info link. The link between relcache and
table stats entry is now a bidirectional link and explicitly de-linked
by a new function pgstat_delinkstats().


36: (Perhaps fixed. I'm not confident, though.)
> > +    return tabentry;
> >  }
> >
> > +
> >  /*
> >   * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
> >   *
> > - * If no entry, return NULL, don't create a new one
> > + *  Find any existing PgStat_TableStatus entry for rel from the current
> > + *  database then from shared tables.
> 
> What do you mean with "from the current database then from shared
> tables"?

It is rewritten as the following, is this readable?

| *  Find any existing PgStat_TableStatus entry for rel_id in the current
| *  database. If not found, try finding from shared tables.

37: (Maybe fixed.)
> >  void
> > -pgstat_send_archiver(const char *xlog, bool failed)
> > +pgstat_report_archiver(const char *xlog, bool failed)
> >  {
..
> > +    if (failed)
> > +    {
> > +        /* Failed archival attempt */
> > +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +        ++shared_archiverStats->failed_count;
> > +        memcpy(shared_archiverStats->last_failed_wal, xlog,
> > +               sizeof(shared_archiverStats->last_failed_wal));
> > +        shared_archiverStats->last_failed_timestamp = now;
> > +        LWLockRelease(StatsLock);
> > +    }
> > +    else
> > +    {
> > +        /* Successful archival operation */
> > +        LWLockAcquire(StatsLock, LW_EXCLUSIVE);
> > +        ++shared_archiverStats->archived_count;
> > +        memcpy(shared_archiverStats->last_archived_wal, xlog,
> > +               sizeof(shared_archiverStats->last_archived_wal));
> > +        shared_archiverStats->last_archived_timestamp = now;
> > +        LWLockRelease(StatsLock);
> > +    }
> >  }
> 
> Huh, why is this duplicating near equivalent code?

To avoid branches within a lock section, or since it is simply
expanded from the master. They can be reset by backends so I couldn't
change it to use changecount protocol. So it still uses LWLock but the
common code is factored out in the attached version.

In connection with this, While I was looking at bgwriter and
checkpointer to see if the statistics of the two could be split, I
found the following comment in checkpoiner.c.

| * Send off activity statistics to the activity stats facility.  (The
| * reason why we re-use bgwriter-related code for this is that the
| * bgwriter and checkpointer used to be just one process.  It's
| * probably not worth the trouble to split the stats support into two
| * independent stats message types.)

So I split the two to try getting rid of LWLock for the global stats,
but resetting counter prevented me from doing that. In the attached
version, I left it as it is because I've done it..


38: (Haven't addressed.)
> >  /* ----------
> >   * pgstat_write_statsfiles() -
> > - *        Write the global statistics file, as well as requested DB files.
> > - *
> > - *    'permanent' specifies writing to the permanent files not temporary ones.
> > - *    When true (happens only when the collector is shutting down), also remove
> > - *    the temporary files so that backends starting up under a new postmaster
> > - *    can't read old data before the new collector is ready.
> > - *
> > - *    When 'allDbs' is false, only the requested databases (listed in
> > - *    pending_write_requests) will be written; otherwise, all databases
> > - *    will be written.
> > + *        Write the global statistics file, as well as DB files.
> >   * ----------
> >   */
> > -static void
> > -pgstat_write_statsfiles(bool permanent, bool allDbs)
> > +void
> > +pgstat_write_statsfiles(void)
> >  {
> 
> Whats the locking around this?

No locking is used there. The code is (currently) guaranteed to be the
only process that reads it.  Added a comment and an assertion.  I did
the same to pgstat_read_statsfile().


39: (Fixed.)
> > -pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> > +pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
> >  {
> > -    HASH_SEQ_STATUS tstat;
> > -    HASH_SEQ_STATUS fstat;
> > -    PgStat_StatTabEntry *tabentry;
> > -    PgStat_StatFuncEntry *funcentry;
> > +    PgStatEnvelope **envlist;
> > +    PgStatEnvelope **penv;
> >      FILE       *fpout;
> >      int32        format_id;
> >      Oid            dbid = dbentry->databaseid;
> > @@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> >      char        tmpfile[MAXPGPATH];
> >      char        statfile[MAXPGPATH];
> >
> > -    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
> > -    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
> > +    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
> > +    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
> >
> >      elog(DEBUG2, "writing stats file \"%s\"", statfile);
> >
> > @@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
> >      /*
> >       * Walk through the database's access stats per table.
> >       */
> > -    hash_seq_init(&tstat, dbentry->tables);
> > -    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
> > +    for (penv = envlist; *penv != NULL; penv++)
> 
> In several of these collect_stat_entries() callers it really bothers me
> that we basically allocate an array as large as the number of objects
> in the database (That's fine for databases, but for tables...). Without
> much need as far as I can see.

collect_stat_entries() is removed (#28) and the callers now handles
entries directly in the dshash_seq_next loop.

40: (Fixed.)
> >      {
> > +        PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
> > +
> >          fputc('T', fpout);
> >          rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
> >          (void) rc;                /* we'll check for error with ferror */
> >      }
> > +    pfree(envlist);
> >
> >      /*
> >       * Walk through the database's function stats table.
> >       */
> > -    hash_seq_init(&fstat, dbentry->functions);
> > -    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
> > +    envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
> > +    for (penv = envlist; *penv != NULL; penv++)
> >      {
> > +        PgStat_StatFuncEntry *funcentry =
> > +        (PgStat_StatFuncEntry *) &(*penv)->body;
> > +
> >          fputc('F', fpout);
> >          rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
> >          (void) rc;                /* we'll check for error with ferror */
> >      }
> > +    pfree(envlist);
> 
> Why do we need separate loops for every type of object here?

Just to keep the file format. But we decided to change it (#26) and it
is now a juble of all kinds of stats
entries. pgstat_write/read_statsfile() become far simpler.


41: (Fixed.)
> > +/* ----------
> > + * create_missing_dbentries() -
> > + *
> > + *  There may be the case where database entry is missing for the database
> > + *  where object stats are recorded. This function creates such missing
> > + *  dbentries so that so that all stats entries can be written out to files.
> > + * ----------
> > + */
> > +static void
> > +create_missing_dbentries(void)
> > +{
> 
> In which situation is this necessary?

It is because the old file format required that entries. It is no
longer needed and removed in #26.


42: (Sorry, but I didn't get your point..)
> > +static PgStatEnvelope *
> > +get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
> > +               bool nowait, entry_initializer initfunc, bool *found)
> > +{
> 
> > +    bool        create = (initfunc != NULL);
> > +    PgStatHashEntry *shent;
> > +    PgStatEnvelope *shenv = NULL;
> > +    PgStatHashEntryKey key;
> > +    bool        myfound;
> > +
> > +    Assert(type != PGSTAT_TYPE_ALL);
> > +
> > +    key.type = type;
> > +    key.databaseid = dbid;
> > +    key.objectid = objid;
> > +    shent = dshash_find_extended(pgStatSharedHash, &key,
> > +                                 create, nowait, create, &myfound);
> > +    if (shent)
> >      {
> > -        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
> > +        if (create && !myfound)
> > +        {
> > +            /* Create new stats envelope. */
> > +            size_t        envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
> > +            dsa_pointer chunk = dsa_allocate0(area, envsize);
> 
> > +            /*
> > +             * The lock on dshsh is released just after. Call initializer
> > +             * callback before it is exposed to other process.
> > +             */
> > +            if (initfunc)
> > +                initfunc(shenv);
> > +
> > +            /* Link the new entry from the hash entry. */
> > +            shent->env = chunk;
> > +        }
> > +        else
> > +            shenv = dsa_get_address(area, shent->env);
> > +
> > +        dshash_release_lock(pgStatSharedHash, shent);
> 
> Doesn't this mean that by this time the entry could already have been
> removed by a concurrent backend, and the dsa allocation freed?

Does "by this time" mean before the dshash_find_extended, or after it
and until dshash_release_lock?

We can create an entry for a just droppted object but it should be
removed again by the next vacuum.

The newly created entry (or its partition) is exclusively locked so no
concurrent backend does not find it until the dshash_release_lock.

The shenv could be removed until the caller accesses it. But since the
function is requested for an existing object, that cannot be removed
until the first vacuum after the transaction end. I added a comment
just before the dshash_release_lock in get_stat_entry().


43: (Fixed. But has a side effect.)
> > Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory
> >
> > The GUC used to specify the directory to store temporary statistics
> > files. It is no longer needed by the stats collector but still used by
> > the programs in bin and contrib, and maybe other extensions. Thus this
> > patch removes the GUC but some backing variables and macro definitions
> > are left alone for backward compatibility.
> 
> I don't see what this achieves? Which use of those variables / macros
> would would be safe? I think it'd be better to just remove them.

pg_stat_statements used PG_STAT_TMP directory to store a temporary
file. I just replaced it with PGSTAT_STAT_PERMANENT_DIRECTORY.  As the
result basebackup copies the temporary file of pg_stat_statements.

By the way, basebackup exludes pg_stat_tmp diretory but sends pg_stat
direcoty. On the other hand when we start a server from a base backup,
it starts crash recovery first and removes stats files anyway. Why
does basebackup send pg_stat direcoty then? (Added as 0007.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1ba12492bec139dffff2b1aa61468af7f2eca8e8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v38 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From 6225db225affd612a27e2c4dac95135ba1d7484e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v38 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 6db81f15b246ab3ff5bcb2f1855108e09d3b73be Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v38 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 76b2f5066f..81bfaea869 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From 525bdc9aaf73795afbbea1dc64e80591a73fedbb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v38 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    4 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   24 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5379 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   83 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  655 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2349 insertions(+), 4136 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168d..3cb6e20ed5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -599,7 +599,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 79a77ebbfe..b40f85e635 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8577,9 +8577,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0974f3e23a..9507fb8210 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1688,28 +1688,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..c5477ff567 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 3e7dcd4f76..957537b6a2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,14 +495,8 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -708,9 +702,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1255,8 +1249,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..c7f0503d81 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (600000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,32 +35,25 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -73,35 +61,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            600000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +89,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,16 +103,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter global statistics counters (unused in other processes).
- * Stored directly in a stats message structure so it can be sent
- * without needing to copy things around.  We assume this inits to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -159,73 +126,216 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+
+    /* stats */
+    PgStat_Archiver            archiver_stats;
+    PgStat_BgWriter            bgwriter_stats;
+    PgStat_CheckPointer        checkpointer_stats;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION    /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),/* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)/* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of shared entry. Use these macro to
+ * know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from ShmemStats->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -262,11 +372,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -275,20 +382,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -297,37 +393,52 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -337,486 +448,582 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
- *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+/* ----------
+ * get_stat_entry() -
  *
- *    Returns PID of child process, or 0 if fail.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
  */
-int
-pgstat_start(void)
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey        key;
+    bool                    myfound;
 
-    /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
-     */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    if (pgStatEntHash)
+    {
+        uint64 currage;
 
-    /*
-     * Okay, fork off the collector.
-     */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &myfound);
+    if (shhashent)
     {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+        if (create && !myfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &myfound);
+
+            Assert(!myfound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = myfound;
+
+    return shheader;
+}
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
 
-        default:
-            return (int) pgStatPid;
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -824,147 +1031,386 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+#define HAVE_ANY_PENDING_STATS()                        \
+    (pgStatLocalHash != NULL ||                            \
+     pgStatXactCommit != 0 || pgStatXactRollback != 0)
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (area == NULL || !HAVE_ANY_PENDING_STATS())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        PgStatLocalHashEntry   *lent;
+        ListCell               *lc;
+        int                        remains = 0;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (HAVE_ANY_PENDING_STATS())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -972,282 +1418,130 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    nvictims = 0;
 
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            dbid = dbentry->databaseid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
         /*
-         * Not there, so add this table's Oid to the message
+         * Don't drop entries for other than database objects not of the
+         * current database.
          */
-        msg.m_tableid[msg.m_nentries++] = tabid;
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
 
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        switch (ent->key.type)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
         }
-    }
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
 
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1256,81 +1550,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1339,20 +1613,46 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1361,29 +1661,42 @@ pgstat_reset_counters(void)
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
+    TimestampTz now = GetCurrentTimestamp();
+    bool        is_archiver;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    is_archiver = (strcmp(target, "archiver") == 0);
 
-    if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
-    else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
-    else
+    if (!is_archiver && strcmp(target, "bgwriter") != 0)
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\" or \"bgwriter\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    if (is_archiver)
+    {
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        StatsShmem->archiver_stats.stat_reset_timestamp = now;
+        cached_archiverstats_is_valid = false;
+    }
+    else
+    {
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
+        StatsShmem->bgwriter_stats.stat_reset_timestamp = now;
+        cached_bgwriterstats_is_valid = false;
+        cached_checkpointerstats_is_valid = false;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1392,17 +1705,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1418,15 +1751,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1440,48 +1798,93 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1895,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1502,10 +1907,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1523,154 +1928,176 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1685,25 +2112,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1717,31 +2128,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1782,9 +2199,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1796,8 +2210,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1813,7 +2226,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1824,121 +2238,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2353,8 +2706,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2369,8 +2720,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2415,7 +2766,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2451,7 +2802,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2471,85 +2822,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2558,30 +2962,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2641,39 +3061,84 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_archiverstats, &StatsShmem->archiver_stats,
+           sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
 
-    return &archiverStats;
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_bgwriterstats, &StatsShmem->bgwriter_stats,
+           sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    cached_bgwriterstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_bgwriterstats;
 }
 
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_checkpointerstats, &StatsShmem->checkpointer_stats,
+           sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    cached_checkpointerstats_is_valid = true;
+
+    return &cached_checkpointerstats;
+}
 
 /*
  * ---------
@@ -2686,9 +3151,27 @@ pgstat_fetch_global(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 
@@ -2902,8 +3385,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3079,12 +3562,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3097,7 +3583,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3114,6 +3600,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3374,7 +3864,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3669,8 +4160,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4324,94 +4815,62 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+    LWLockRelease(StatsLock);
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4420,473 +4879,113 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_checkpointer() -
  *
- *        Send SLRU statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
 
     /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
+        return;
 
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+    LWLockRelease(StatsLock);
 
     /*
-     * Save the final stats to reuse at next startup.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/*
- * Subroutine to clear stats in a database entry
+/* ----------
+ * get_local_dbstat_entry() -
  *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4906,7 +5005,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4918,182 +5017,61 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Checkpointer global stats struct
+     */
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatDBEntry *dbentry;
+        void               *pent;
+        size_t                len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        pent = dsa_get_address(area, ps->body);
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            dbentry = (PgStat_StatDBEntry *) pent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Exclude header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(pent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5127,102 +5105,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5231,624 +5170,137 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
         goto done;
     }
 
     /*
-     * Read global stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
+     * Read checkpointer stats struct
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        ereport(LOG,
+                (errmsg("0corrupted statistics file \"%s\"", statfile)));
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -5865,795 +5317,23 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_slrustats_is_valid = false;
 }
 
 /*
@@ -6744,7 +5424,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -6755,7 +5435,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6765,41 +5445,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..5ee7110444 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2045,7 +2045,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2155,7 +2155,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2345,7 +2345,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2353,7 +2353,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 95738a4e34..f6dc875a25 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1965,7 +1964,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2039,7 +2038,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9061af81a3..d23cc2d0a9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -71,6 +71,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2353,6 +2354,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..29eb459e35 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..046bf21485 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,35 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -80,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -118,13 +90,6 @@ typedef struct PgStat_TableCounts
     PgStat_Counter t_blocks_hit;
 } PgStat_TableCounts;
 
-/* Possible targets for resetting cluster-wide shared values */
-typedef enum PgStat_Shared_Reset_Target
-{
-    RESET_ARCHIVER,
-    RESET_BGWRITER
-} PgStat_Shared_Reset_Target;
-
 /* Possible object types for resetting single counters */
 typedef enum PgStat_Single_Reset_Type
 {
@@ -155,10 +120,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -183,308 +151,57 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
+/*
+ * Archiver statistics kept in the shared stats
  */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
+typedef struct PgStat_Archiver
 {
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgBgWriter
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-typedef struct PgStat_MsgSLRU
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -500,7 +217,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -516,98 +232,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -616,13 +242,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9D
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -632,7 +254,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -642,29 +263,87 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
 
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
+
+
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
+
+/* ----------
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -684,25 +363,21 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
@@ -710,57 +385,6 @@ typedef struct PgStat_StatFuncEntry
 } PgStat_StatFuncEntry;
 
 
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
-
 /* ----------
  * Backend states
  * ----------
@@ -808,7 +432,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1060,7 +684,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1257,13 +881,21 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1278,29 +910,22 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
@@ -1340,6 +965,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1462,8 +1088,9 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1472,12 +1099,17 @@ extern void pgstat_send_bgwriter(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
@@ -1489,5 +1121,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 0fa7e49eddda07ef0a1f6d3903ff16081fc2d4fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v38 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index de9bacd34f..69db5afc94 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9209,9 +9209,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..cc5dc1173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7192,11 +7192,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7212,14 +7212,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7250,9 +7249,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8313,7 +8312,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8325,9 +8324,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 42f01c515f..ec02e72dc0 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2367,12 +2367,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..7a04d58a1a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -620,7 +610,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1057,10 +1047,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1815,6 +1801,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5738,9 +5728,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index e09ed0a4c3..71bb24accf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1290,11 +1290,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 9e3c966f9221b724d805b125481064be46b564d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v38 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index b9331830f7..5096963234 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cc5dc1173f..d8d99bb546 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7305,25 +7305,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c7f0503d81..6e2053e73d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -98,16 +98,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 29eb459e35..87296bf2aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11608,35 +11594,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 118b282d1c..9e5a3a01ed 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 1abc257177..d2192429bc 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -53,13 +53,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 046bf21485..44738d4aed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1488bffa2b..bb5474b878 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From b9092abc52bc0839f0a12e03f05581178aa08cef Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v38 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Rebased on a previously committed WAL-stats patch.

I found a bug that the maximum interval was wrongly set to 600s
instead of 60s.

The previous version failed to flush local database stats for certain
condition. That behavior caused useless retries and finally a forced
flush that leads to contention. I fixed that and will measure
performance with this version.

Now that global stats are split into bgwriter stats and checkpointer
stats, that stats are updated only by one process each. However, they
are reset by client backends so LWLock is still needed to protect
them.  To get rid of the LWLocks, pgstat_reset_shared_counters() is
changed so as to avoid scribble on the shared structs.

Finally archiver, bgwriter and checkpointer stats no longer need
LWLock to update, read and reset.  Still reader-reader conflict on
StatsLock occurs but that doesn't affect writer processes.

WAL stats is written from many backends so it still requires LWLock to
reset, update and read.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 3411e131c6d0e4f0012e16ce41aa10eced69f99c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v39 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From e866144f21881035533269921fc23618d155f0ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v39 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 3a65fb6fc1b8aa0b82c4b883a5491e2acccde5c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v39 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 76b2f5066f..81bfaea869 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From ca5fdad4f8e883c89d3b58878aeb60ebfc91b59e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v39 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/interrupt.c            |    5 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 5683 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   85 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  656 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2600 insertions(+), 4196 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 4f2f38168d..3cb6e20ed5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -599,7 +599,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f11b1b9de..6c120873ff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2196,7 +2196,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8579,9 +8579,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0974f3e23a..9507fb8210 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1688,28 +1688,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Close relations */
     table_close(pg_class, RowExclusiveLock);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ddeec870d8..c5477ff567 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..fe6937f8af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1258,8 +1252,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/interrupt.c b/src/backend/postmaster/interrupt.c
index 3d02439b79..53844eb8bb 100644
--- a/src/backend/postmaster/interrupt.c
+++ b/src/backend/postmaster/interrupt.c
@@ -93,9 +93,8 @@ SignalHandlerForCrashExit(SIGNAL_ARGS)
  * shut down and exit.
  *
  * Typically, this handler would be used for SIGTERM, but some procesess use
- * other signals. In particular, the checkpointer exits on SIGUSR2, the
- * stats collector on SIGQUIT, and the WAL writer exits on either SIGINT
- * or SIGTERM.
+ * other signals. In particular, the checkpointer exits on SIGUSR2, and the WAL
+ * writer exits on either SIGINT or SIGTERM.
  *
  * ShutdownRequestPending should be checked at a convenient place within the
  * main loop, or else the main loop should call HandleMainLoopInterrupts.
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5294c78549..1a5c1ca24c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,32 +35,25 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
 #include "postmaster/interrupt.h"
 #include "postmaster/postmaster.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -73,35 +61,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -116,7 +89,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -131,17 +103,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -160,73 +126,242 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_BgWriter all_zeroes;
+
+    return memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION    /* per-function statistics */
+}            PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),/* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry)/* PGSTAT_TYPE_FUNCTION */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of shared entry. Use these macro to
+ * know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -263,11 +398,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -276,21 +408,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -299,37 +419,55 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -339,487 +477,629 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
  *
- *    Returns PID of child process, or 0 if fail.
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey        key;
+    bool                    myfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &myfound);
+    if (shhashent)
+    {
+        if (create && !myfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &myfound);
+
+            Assert(!myfound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = myfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* We assume this initializes to zeroes */
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (walstats_pending())
+        return true;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
+
+    return true;
+}
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -827,150 +1107,396 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -978,282 +1504,130 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    nvictims = 0;
 
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            dbid = dbentry->databaseid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
-        return;
-
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
-
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
-
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            tabid = tabentry->tableid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
-            continue;
-
         /*
-         * Not there, so add this table's Oid to the message
+         * Don't drop entries for other than database objects not of the
+         * current database.
          */
-        msg.m_tableid[msg.m_nentries++] = tabid;
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
+            continue;
 
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+        switch (ent->key.type)
         {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
         }
-    }
 
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
 
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
-        {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
-                continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
-
-        hash_destroy(htab);
-    }
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1262,81 +1636,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1345,53 +1699,144 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1400,17 +1845,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1426,15 +1891,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1448,48 +1938,93 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1500,9 +2035,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1510,10 +2047,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1531,154 +2068,176 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1693,25 +2252,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1725,31 +2268,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1790,9 +2339,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1804,8 +2350,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1821,7 +2366,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1832,121 +2378,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2361,8 +2846,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2377,8 +2860,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2423,7 +2906,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2459,7 +2942,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2479,85 +2962,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2566,30 +3102,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2649,53 +3201,157 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+    LWLockRelease(StatsLock);
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2709,9 +3365,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 
@@ -2925,8 +3599,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3102,12 +3776,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3120,7 +3797,7 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
@@ -3137,6 +3814,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3397,7 +4078,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3692,8 +4374,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4347,94 +5029,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4443,509 +5111,134 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
-
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4965,7 +5258,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4975,190 +5268,69 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatDBEntry *dbentry;
+        void               *pent;
+        size_t                len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        pent = dsa_get_address(area, ps->body);
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            dbentry = (PgStat_StatDBEntry *) pent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Exclude header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(pent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5192,104 +5364,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5298,647 +5429,149 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
         goto done;
     }
 
     /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
+     * Read bgwiter stats struct
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
-        goto done;
-    }
-
-    /*
-     * Read SLRU stats struct
-     */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
-        goto done;
-    }
-
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        ereport(LOG,
+                (errmsg("0corrupted statistics file \"%s\"", statfile)));
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
-
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -5955,813 +5588,24 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
 }
 
 /*
@@ -6852,7 +5696,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -6863,7 +5707,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -6873,41 +5717,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..5ee7110444 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2045,7 +2045,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2155,7 +2155,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2345,7 +2345,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2353,7 +2353,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 24e191ea30..de0b5f3a97 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1707,7 +1706,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2001,7 +2000,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2075,7 +2074,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9061af81a3..d23cc2d0a9 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -71,6 +71,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2353,6 +2354,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..29eb459e35 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 343eef507e..0bc0e76f46 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -81,9 +83,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -157,10 +158,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -185,318 +189,57 @@ typedef struct PgStat_TableXactStatus
 } PgStat_TableXactStatus;
 
 
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
+/*
+ * Archiver statistics kept in the shared stats
  */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
- */
-typedef struct PgStat_MsgHdr
-{
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
-
-/* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
- * ----------
- */
-typedef struct PgStat_MsgDummy
-{
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_Archiver
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_TableEntry
+typedef struct PgStat_BgWriter
 {
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -512,7 +255,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -528,99 +270,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -629,13 +280,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9E
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -645,7 +292,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -655,29 +301,97 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
+
+
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -697,25 +411,21 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
@@ -723,66 +433,6 @@ typedef struct PgStat_StatFuncEntry
 } PgStat_StatFuncEntry;
 
 
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
-
 /* ----------
  * Backend states
  * ----------
@@ -830,7 +480,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1082,7 +732,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1279,18 +929,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1305,32 +963,26 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1367,6 +1019,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1489,9 +1142,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1500,13 +1154,18 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
@@ -1518,5 +1177,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From d837e09bb719a23f43975686458e600de04f362d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v39 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3927b1030d..87336e81ff 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9210,9 +9210,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee914740cc..b4537fc460 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7192,11 +7192,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7212,14 +7212,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7250,9 +7249,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8313,7 +8312,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8325,9 +8324,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 86da84fce7..24a42d1f44 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2365,12 +2365,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 171ba7049c..7c64cbc667 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -628,7 +618,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1065,10 +1055,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1823,6 +1809,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5797,9 +5787,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 1400cf8775..fd6c92b347 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1286,11 +1286,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 9fab6a1c2e7deb48122bcd6c7baff425d1a5a6c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v39 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b4537fc460..d4b5258b63 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7305,25 +7305,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1a5c1ca24c..a71bfc17b5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -98,16 +98,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 29eb459e35..87296bf2aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11608,35 +11594,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..5b7eb30f14 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 1abc257177..d2192429bc 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -53,13 +53,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0bc0e76f46..2f3522be5e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e9d932ce..7f43e9872a 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From 2df15f001f0ae01716c3c6c557b28def4df552ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v39 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The previous version failed to flush local database stats for certain
> condition. That behavior caused useless retries and finally a forced
> flush that leads to contention. I fixed that and will measure
> performance with this version.

I (we) got some performance numbers.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

Those showed speed of over 400,000 TPS at maximum, and no siginificant
difference is seen between patched and unpatched at the all range of
the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
but that made no difference.

- Fetching 1 tuple from 1 table from 800 clients.

No graph for this is not attached but this test shows speed of over 42
TPS with or without the v39 patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Вложения

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
It occurred to my mind the fact that I forgot to mention the most
significant outcome of this patch.

At Thu, 08 Oct 2020 16:03:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > The previous version failed to flush local database stats for certain
> > condition. That behavior caused useless retries and finally a forced
> > flush that leads to contention. I fixed that and will measure
> > performance with this version.
> 
> I (we) got some performance numbers.
> 
> - Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
> - Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.
> 
> Those showed speed of over 400,000 TPS at maximum, and no siginificant
> difference is seen between patched and unpatched at the all range of
> the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
> but that made no difference.
> 
> - Fetching 1 tuple from 1 table from 800 clients.
> 
> No graph for this is not attached but this test shows speed of over 42
> TPS with or without the v39 patch.

Under a heavy load and having many tables, the *reader* side takes
seconds ore more time to read the stats table.  With this patch, it
takes almost no time (maybe ms order?) for the same operation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Georgios Kokolatos
Дата:
Hi,

I noticed that according to the cfbot this patch no longer applies.

As it is registered in the upcoming commitfest, it would be appreciated
if you could rebase it.

Cheers,
//Georgios

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 30 Oct 2020 15:00:55 +0000, Georgios Kokolatos <gkokolatos@protonmail.com> wrote in 
> Hi,
> 
> I noticed that according to the cfbot this patch no longer applies.
> 
> As it is registered in the upcoming commitfest, it would be appreciated
> if you could rebase it.

Thanks!  The replication slot stats patch (9868167500) hit this.

- Fixed a bug of original code.

 get_stat_entry() returned a wrong result to found when shared entry
 exists but it is not locally cached.

- Moved replication slot stats into shared memory stats.

 Differently from wal_stats and slru_stats, it can be implemented as a
 part of unified stats entry.  I'm tempted to remove the entry for a
 dropped slot immediately, but I didn't that since the number of the
 slots should be under 10 or so and dropping an entry requires
 exclusive lock on dshash.  Instead, dropped entries are removed at
 file-write time that happens only at the end of a process.


I had to clean up replication slots in pgstat_beshutdown_hook(). Even
though we have exactly the same code in several other places, the
function must be called before disabling DSA because we cannot update
statistics after detaching the shared-memory stats.  Perhaps we can
remove some of the existing calling to ReplicationSlotCleanup() but I
haven't do that in this version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 893a40cb787b4103ad35f241a54ee4aedff09b37 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v40 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From 084bbaf8b74ec1a7e29f7e54ba10c5a8da62f62e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v40 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 2e8c808683581c976aa4c7b523878b18c90a2894 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v40 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..edda899554 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From cdfba4a0f79849b78c0f66af1aae0f16e0ebaaae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v40 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6088 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   89 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  731 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 34 files changed, 2769 insertions(+), 4506 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2174fccb1e..403ab341d9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b1170..75fdd6b607 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2196,7 +2196,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8579,9 +8579,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b88b4a1f12..66249682d6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b6717f727..070bcd1ead 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..fe6937f8af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1258,8 +1252,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f1dca2f25b..a72fc7dddd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,12 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -53,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -74,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -117,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -132,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -161,73 +127,245 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_BgWriter all_zeroes;
+
+    return memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of shared entry. Use these macro to
+ * know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -264,11 +402,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -277,23 +412,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -302,40 +423,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -345,489 +483,630 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* We assume this initializes to zeroes */
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (walstats_pending())
+        return true;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    if (!have_slrustats)
+        return true;
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-        default:
-            return (int) pgStatPid;
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -835,150 +1114,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -986,282 +1514,134 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /* We don't vacuum this kind of entries */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1270,81 +1650,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1353,53 +1713,144 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1408,17 +1859,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1434,15 +1905,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1458,15 +1954,18 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
+            
         /*
          * Check if the slot exits with the given name. It is possible that by
          * the time this message is executed the slot is dropped but at least
@@ -1489,15 +1988,35 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        memcpy(&msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1511,48 +2030,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1563,9 +2127,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1573,10 +2139,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1594,110 +2160,167 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
@@ -1707,24 +2330,49 @@ pgstat_report_tempfile(size_t filesize)
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
+
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters including dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->dropped || !found)
+    {
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1736,55 +2384,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1799,25 +2436,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1831,31 +2452,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1896,9 +2523,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1910,8 +2534,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1927,7 +2550,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1938,121 +2562,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2467,8 +3030,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2483,8 +3044,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2529,7 +3090,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2565,7 +3126,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2585,85 +3146,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2672,30 +3286,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2755,53 +3385,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2815,9 +3552,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2829,13 +3584,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3048,8 +3831,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3225,12 +4008,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3243,13 +4029,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3260,6 +4055,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3520,7 +4319,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3815,8 +4615,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4470,94 +5270,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4566,519 +5352,134 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
-
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5098,7 +5499,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5108,200 +5509,76 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        void               *pent;
+        size_t                len;
 
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
+        CHECK_FOR_INTERRUPTS();
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
+        pent = dsa_get_address(area, ps->body);
 
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) pent;
 
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+        }
+        else if (ps->key.type == PGSTAT_TYPE_REPLSLOT)
+        {
+            PgStat_ReplSlot *replslotent = (PgStat_ReplSlot *) pent;
 
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+            /* Don't write the unused entry */
+            if (replslotent->dropped)
+                continue;
+        }
 
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(pent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5335,114 +5612,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5451,681 +5677,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6142,901 +5837,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    Assert(idx >= 0 && idx <= max_replication_slots);
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-               sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-        Assert(nReplSlotStats >= 0);
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7083,60 +5902,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7181,7 +5946,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7192,7 +5957,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7202,41 +5967,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0adf04814c..7940f237ea 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a210fc93b4..c72042d99e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1707,7 +1706,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2001,7 +2000,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2089,7 +2088,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2159,7 +2158,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2192,7 +2191,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..92a518433e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2354,6 +2355,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..d46ae0ca8d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..d5722a2e87 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +128,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,350 +158,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -545,7 +224,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -561,101 +239,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -664,13 +249,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -680,7 +261,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -690,29 +270,97 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
+
+
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -732,96 +380,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    bool        dropped;
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -830,7 +417,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -879,7 +466,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1131,7 +718,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1328,18 +915,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1354,33 +949,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1420,6 +1009,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1542,9 +1132,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1553,15 +1144,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1572,5 +1168,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 3b33cb5ff31dcfbf6d4f98aa83539d63a88d1902 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v40 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5fb9dca425..884dc12340 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9211,9 +9211,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..b98d47189f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7276,11 +7276,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7296,14 +7296,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7334,9 +7333,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8397,7 +8396,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8409,9 +8408,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..3a1dc17057 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,12 +2329,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..4c1880bbc0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -637,7 +627,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1074,10 +1064,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1832,6 +1818,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5950,9 +5940,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..ad105cb2a6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 196d867aafb3459f6b85730affec0f45783174d8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v40 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b98d47189f..5082298919 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7389,25 +7389,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a72fc7dddd..634f72fb6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d46ae0ca8d..ca6340dcb5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11636,35 +11622,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..5b7eb30f14 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 1abc257177..d2192429bc 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -53,13 +53,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d5722a2e87..feece220db 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ebcaeb44fe..8772fcc970 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From d81b25fab373a0cbb0bba620c6351e0cea5a463c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v40 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Wed, 04 Nov 2020 17:39:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 30 Oct 2020 15:00:55 +0000, Georgios Kokolatos <gkokolatos@protonmail.com> wrote in 
> > Hi,
> > 
> > I noticed that according to the cfbot this patch no longer applies.
> > 
> > As it is registered in the upcoming commitfest, it would be appreciated
> > if you could rebase it.
> 
> Thanks!  The replication slot stats patch (9868167500) hit this.
> 
> - Fixed a bug of original code.
> 
>  get_stat_entry() returned a wrong result to found when shared entry
>  exists but it is not locally cached.
> 
> - Moved replication slot stats into shared memory stats.
> 
>  Differently from wal_stats and slru_stats, it can be implemented as a
>  part of unified stats entry.  I'm tempted to remove the entry for a
>  dropped slot immediately, but I didn't that since the number of the
>  slots should be under 10 or so and dropping an entry requires
>  exclusive lock on dshash.  Instead, dropped entries are removed at
>  file-write time that happens only at the end of a process.
> 
> 
> I had to clean up replication slots in pgstat_beshutdown_hook(). Even
> though we have exactly the same code in several other places, the
> function must be called before disabling DSA because we cannot update
> statistics after detaching the shared-memory stats.  Perhaps we can
> remove some of the existing calling to ReplicationSlotCleanup() but I
> haven't do that in this version.

Fixed a bug that pgstat_report_replslot fails to reuse entries that
are marked as "dropped".

Fixed comments for the caller sites to pgstat_report_replslot(_drop)
in ReplicationSlotCreate() and ReplicationSlotDropPtr().

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f98df0a7238f2da4713a50c590bace42a1c917d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v41 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From d421a7619d1e9adb18d394aa96c6143f6f0e5cea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v41 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From d3cd05cde82db3b83cb9072991e3b95801f19086 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v41 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..edda899554 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 959e3b8873..b811c961a6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -555,6 +555,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1800,7 +1801,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3054,7 +3055,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3189,20 +3190,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3450,7 +3447,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3655,6 +3652,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3951,6 +3960,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5230,7 +5240,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5275,16 +5285,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5526,6 +5526,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From 3982937159c833626d995cf1477cbed20200125d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v41 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6100 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   89 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  734 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2782 insertions(+), 4520 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2174fccb1e..403ab341d9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1078a7cfc..a0c9bfc7c2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2196,7 +2196,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8578,9 +8578,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b88b4a1f12..66249682d6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b6717f727..070bcd1ead 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..fe6937f8af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1258,8 +1252,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f1dca2f25b..45bb19fe1e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,12 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -53,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -74,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -117,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -132,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -161,73 +127,245 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_BgWriter all_zeroes;
+
+    return memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of shared entry. Use these macro to
+ * know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -264,11 +402,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -277,23 +412,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -302,40 +423,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -345,489 +483,630 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* We assume this initializes to zeroes */
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (walstats_pending())
+        return true;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    if (!have_slrustats)
+        return true;
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-        default:
-            return (int) pgStatPid;
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -835,150 +1114,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -986,282 +1514,134 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /* We don't vacuum this kind of entries */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1270,81 +1650,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1353,53 +1713,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1408,17 +1861,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1434,15 +1907,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1458,20 +1956,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1489,15 +1986,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        memcpy(&msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1511,48 +2029,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1563,9 +2126,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1573,10 +2138,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1594,137 +2159,220 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
+
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->dropped || !found)
+    {
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        shent->dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1736,55 +2384,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    memcpy(&msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1799,25 +2436,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1831,31 +2452,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1896,9 +2523,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1910,8 +2534,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1927,7 +2550,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1938,121 +2562,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2467,8 +3030,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2483,8 +3044,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2529,7 +3090,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2565,7 +3126,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2585,85 +3146,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2672,30 +3286,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2755,53 +3385,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2815,9 +3552,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2829,13 +3584,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3048,8 +3831,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3225,12 +4008,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3243,13 +4029,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3260,6 +4055,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3520,7 +4319,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3815,8 +4615,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4470,94 +5270,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4566,519 +5352,134 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
-
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5098,7 +5499,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5108,200 +5509,76 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
-
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
-
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        void               *pent;
+        size_t                len;
 
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
+        CHECK_FOR_INTERRUPTS();
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
+        pent = dsa_get_address(area, ps->body);
 
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) pent;
 
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+        }
+        else if (ps->key.type == PGSTAT_TYPE_REPLSLOT)
+        {
+            PgStat_ReplSlot *replslotent = (PgStat_ReplSlot *) pent;
 
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
+            /* Don't write the unused entry */
+            if (replslotent->dropped)
+                continue;
+        }
 
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(pent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5335,114 +5612,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5451,681 +5677,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6142,901 +5837,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    Assert(idx >= 0 && idx <= max_replication_slots);
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-               sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-        Assert(nReplSlotStats >= 0);
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7083,60 +5902,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7181,7 +5946,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7192,7 +5957,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7202,41 +5967,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b811c961a6..526021def2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -257,7 +257,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -518,7 +517,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1340,12 +1338,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1794,11 +1786,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2728,8 +2715,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3056,8 +3041,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3124,13 +3107,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3203,22 +3179,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3679,22 +3639,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3914,8 +3858,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3939,8 +3881,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3950,8 +3891,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4152,8 +4092,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5130,18 +5068,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5260,12 +5186,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6170,7 +6090,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6226,8 +6145,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6462,7 +6379,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 09be1d8c48..163fc3da2a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..aef6be94a3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..5043736f1f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3783,6 +3789,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4179,11 +4186,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4215,6 +4223,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4227,8 +4237,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4263,7 +4278,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4271,6 +4286,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a210fc93b4..c72042d99e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1707,7 +1706,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2001,7 +2000,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2089,7 +2088,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2159,7 +2158,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2192,7 +2191,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..92a518433e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2354,6 +2355,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index d4ab4c7e23..4ff4cc33d9 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1245,6 +1248,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..d46ae0ca8d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..ac9846b9c1 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +128,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,350 +158,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -545,7 +224,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -561,101 +239,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -664,13 +249,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -680,7 +261,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -690,29 +270,98 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
+
+
+/* ----------
+ * PgStat_HashMountInfo  dshash mount information
+ * ----------
+ */
+typedef struct PgStat_HashMountInfo
+{
+    HTAB       *snapshot_tables;    /* table entry snapshot */
+    HTAB       *snapshot_functions; /* function entry snapshot */
+    dshash_table *dshash_tables;    /* attached tables dshash */
+    dshash_table *dshash_functions; /* attached functions dshash */
+} PgStat_HashMountInfo;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -732,96 +381,37 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    bool        dropped;
+    /* Persistent data follow */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -830,7 +420,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -879,7 +469,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1131,7 +721,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1328,18 +918,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1354,33 +952,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1420,6 +1012,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1542,9 +1135,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1553,15 +1147,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1572,5 +1171,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 433138e85f3b6e98a4a4a31d013764196f354fd7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v41 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5fb9dca425..884dc12340 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9211,9 +9211,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..b98d47189f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7276,11 +7276,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7296,14 +7296,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7334,9 +7333,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8397,7 +8396,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8409,9 +8408,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..3a1dc17057 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,12 +2329,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..4c1880bbc0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -637,7 +627,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1074,10 +1064,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1832,6 +1818,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5950,9 +5940,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..ad105cb2a6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From 940730c60899fd5c14851c4e193c7e3708e99b4b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v41 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b98d47189f..5082298919 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7389,25 +7389,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 45bb19fe1e..728cd92609 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d46ae0ca8d..ca6340dcb5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11636,35 +11622,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..5b7eb30f14 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 314b064b22..77dbf8e1b8 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ac9846b9c1..3dbdf9d844 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ebcaeb44fe..8772fcc970 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From 8e65b5d934de1b8bf52022c1be7776c5508e3d6c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v41 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
4f841ce3f7 hit this. Rebased.

At Fri, 06 Nov 2020 09:27:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Fixed a bug that pgstat_report_replslot fails to reuse entries that
> are marked as "dropped".
> 
> Fixed comments for the caller sites to pgstat_report_replslot(_drop)
> in ReplicationSlotCreate() and ReplicationSlotDropPtr().

The following changes were made along with the rebasing.

- Removed a useless struct PgStat_HashMountInfo.

- Removed a duplicate member "dropped" from PgStat_ReplSlot.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From deb93b7c5c0ec41717256530c02db5e4199d3877 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v42 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.18.4

From 8c06b1f7f565635b25494721ec961aee1e7f0826 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v42 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.18.4

From 51270ec2307f771e1982d380fe789701f19af10a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v42 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index cae93ab69d..6908bec2f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -489,8 +491,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..edda899554 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b7799ed1d2..a1dd6964e9 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -547,6 +547,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1785,7 +1786,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3039,7 +3040,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3174,20 +3175,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3435,7 +3432,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3640,6 +3637,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3936,6 +3945,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5177,7 +5187,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5222,16 +5232,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5473,6 +5473,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 88566bd9fa..746bed773e 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9c9a50ae45..de20520b8c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -345,6 +345,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.18.4

From 7ff0c9250c1691450a3fb939f6913193233664ec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v42 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   74 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6103 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   89 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  719 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2769 insertions(+), 4521 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..49df584a9e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1061,8 +1061,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b..9ce56b54b6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1078a7cfc..a0c9bfc7c2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2196,7 +2196,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8578,9 +8578,9 @@ LogCheckpointEnd(bool restartpoint)
                         &sync_secs, &sync_usecs);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time +=
+    CheckPointerStats.checkpoint_write_time +=
         write_secs * 1000 + write_usecs / 1000;
-    BgWriterStats.m_checkpoint_sync_time +=
+    CheckPointerStats.checkpoint_sync_time +=
         sync_secs * 1000 + sync_usecs / 1000;
 
     /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b88b4a1f12..66249682d6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index f80a9e96a9..e7ccc6eba7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 1b6717f727..070bcd1ead 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..773b82be3b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -338,9 +338,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1680,12 +1677,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1957,8 +1954,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1977,17 +1972,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2030,9 +2019,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2111,8 +2097,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2195,8 +2181,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2755,29 +2741,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2798,17 +2761,12 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     bool        doanalyze;
     autovac_table *tab = NULL;
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     bool        wraparound;
     AutoVacOpts *avopts;
 
     /* use fresh stats */
     autovac_refresh_stats();
 
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* fetch the relation's relcache entry */
     classTup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(relid));
     if (!HeapTupleIsValid(classTup))
@@ -2832,8 +2790,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
     }
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -2955,7 +2913,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -2965,8 +2923,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..fe6937f8af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1258,8 +1252,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e76e627c6b..6f0e3e2e4e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,12 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -53,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -74,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -117,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -132,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -161,73 +127,245 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_BgWriter all_zeroes;
+
+    return memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -264,11 +402,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -277,23 +412,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -302,40 +423,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -345,489 +483,630 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
+
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
+    {
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
+    }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
+
+    pgstat_setup_memcxt();
+
+    if (area)
+        return;
+
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
     /*
-     * Create the UDP socket for sending and receiving statistic messages
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
      */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        bool hold_off;
+
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
+
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        StatsShmem->refcount = 1;
+    }
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    LWLockRelease(StatsLock);
 
+    if (area)
+    {
         /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
+         * We're the first attacher process, read stats file while blocking
+         * successors.
          */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
     }
-
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
-
-    /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
-     */
-    if (!pg_set_noblock(pgStatSock))
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
-    }
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
-    {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            elog(LOG, "getsockopt(SO_RCVBUF) failed: %m");
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                elog(LOG, "setsockopt(SO_RCVBUF) failed: %m");
-        }
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    /* We assume this initializes to zeroes */
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (walstats_pending())
+        return true;
 
-    /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    if (!have_slrustats)
+        return true;
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-        default:
-            return (int) pgStatPid;
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
 {
-    last_pgstat_start_time = 0;
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
+}
+
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
+{
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -835,150 +1114,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -986,282 +1514,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1270,81 +1654,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1353,53 +1717,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1408,17 +1865,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1434,15 +1911,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1458,20 +1960,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1489,15 +1990,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1511,48 +2033,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1563,9 +2130,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1573,10 +2142,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1594,137 +2163,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1736,55 +2391,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1799,25 +2443,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1831,31 +2459,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1896,9 +2530,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1910,8 +2541,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1927,7 +2557,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1938,121 +2569,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2467,8 +3037,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2483,8 +3051,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2529,7 +3097,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2565,7 +3133,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2585,85 +3153,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2672,30 +3293,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2755,53 +3392,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2815,9 +3559,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2829,13 +3591,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3048,8 +3838,8 @@ pgstat_initialize(void)
         MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
     }
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3225,12 +4015,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3243,13 +4036,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3260,6 +4062,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3520,7 +4326,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3815,8 +4622,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4470,94 +5277,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4566,519 +5359,134 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
-
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5098,7 +5506,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5108,200 +5516,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5335,114 +5615,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5451,681 +5680,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6142,903 +5840,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx >= 0 && idx < max_replication_slots);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-               sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-        Assert(nReplSlotStats >= 0);
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7085,60 +5905,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7183,7 +5949,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7194,7 +5960,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7204,41 +5970,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a1dd6964e9..13be045c99 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,7 +250,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -511,7 +510,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1325,12 +1323,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1779,11 +1771,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2713,8 +2700,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3041,8 +3026,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3109,13 +3092,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3188,22 +3164,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3664,22 +3624,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3899,8 +3843,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3924,8 +3866,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3935,8 +3876,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4137,8 +4077,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5077,18 +5015,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5207,12 +5133,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6117,7 +6037,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6173,8 +6092,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6407,7 +6324,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..57531d7d48 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1556,8 +1556,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 09be1d8c48..163fc3da2a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..aef6be94a3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 2fa90cc095..17a46db74b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 7c5f7c775b..e2b248a87c 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3209,6 +3209,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3779,6 +3785,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4175,11 +4182,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4211,6 +4219,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4223,8 +4233,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4259,7 +4274,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4267,6 +4282,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index a210fc93b4..c72042d99e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role)) 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1707,7 +1706,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2001,7 +2000,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2089,7 +2088,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2159,7 +2158,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2192,7 +2191,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..92a518433e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2354,6 +2355,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f2dd8e4914..6d6b50eada 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -630,6 +631,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1244,6 +1247,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index bb34630e8e..4ee9fee7c9 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4310,7 +4310,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4646,7 +4646,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..668a2d033a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -579,7 +579,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 257e515bfe..03dcb367ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -15,9 +15,11 @@
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +29,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +43,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +53,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +128,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,350 +158,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -545,7 +224,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -561,101 +239,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -664,13 +249,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -680,7 +261,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -690,29 +270,85 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -732,96 +368,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -830,7 +405,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -879,7 +454,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1131,7 +706,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1328,18 +903,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1354,33 +937,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1420,6 +997,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1542,9 +1120,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1553,15 +1132,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1572,5 +1156,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..3b03464a1a 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.18.4

From 54fb5f0c9f77b1e928d6c2b997150b89b89ada3f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v42 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 569841398b..d5b26a51df 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9211,9 +9211,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..b98d47189f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7276,11 +7276,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7296,14 +7296,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7334,9 +7333,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8397,7 +8396,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8409,9 +8408,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..3a1dc17057 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,12 +2329,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e1995453..4c1880bbc0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -637,7 +627,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1074,10 +1064,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1832,6 +1818,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5950,9 +5940,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..ad105cb2a6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.18.4

From ee5172d422406f36ac9102e5ec914e4089474153 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v42 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..5eaceb60a7 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20171004;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b98d47189f..5082298919 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7389,25 +7389,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6f0e3e2e4e..0eb6d49a87 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 57531d7d48..25eabbb1ad 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1365,17 +1340,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4ee9fee7c9..3c2a4f515c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4309,17 +4306,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11636,35 +11622,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 668a2d033a..7183c08305 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -586,7 +586,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..5b7eb30f14 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ba34dbac14..00aed706bb 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 03dcb367ce..c837e86744 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -32,9 +32,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ebcaeb44fe..8772fcc970 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.18.4

From 2df7028caaf5ca6018f466b35093608103c8116d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v42 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 25eabbb1ad..dd35920e82 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.18.4


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Wed, 11 Nov 2020 10:07:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> 4f841ce3f7 hit this. Rebased.

- 01469241b2 and e2ac3fed3b (maybe that's all) have hit this. Rebased.

- Fixed some silly bugs of WAL statistics.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5fc56e600b47b19dfc7e70c4f4caa553d99f9478 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v43 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.27.0

From c174a95bbaa6d6509ee7289c12ad04184eb3e46c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v43 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From 28a0bc2e60e7e96a991e67925f8332c00f663283 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v43 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index f39dc4ddf1..d69e20961d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -490,8 +492,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..edda899554 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5d09822c81..8a4706a870 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -547,6 +547,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1787,7 +1788,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3041,7 +3042,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3176,20 +3177,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3437,7 +3434,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3642,6 +3639,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3938,6 +3947,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5185,7 +5195,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5230,16 +5240,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5481,6 +5481,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..fc23539137 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index e77f76ae8a..a656910d02 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -354,6 +354,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.27.0

From 6806a1ef79c8f501223c7e80f94d098ada27a620 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v43 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6143 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   95 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  727 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2790 insertions(+), 4556 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..a0eeafe524 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b..9ce56b54b6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7e81ce4f17..3249f8210b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2197,7 +2197,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8581,8 +8581,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..f8e6784975 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index cfc63915f3..2eef6c1654 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..925c75e296 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 7e28944d2f..3c546618aa 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2115,8 +2101,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2199,8 +2185,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2759,29 +2745,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2985,17 +2948,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3025,7 +2981,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3035,8 +2991,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 429c8010ef..fe6937f8af 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1258,8 +1252,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7c75a25d21..891118883c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -133,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
@@ -170,73 +135,246 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -273,11 +411,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -286,23 +421,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -311,40 +432,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -354,491 +492,645 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    Assert(pgStatEntHash == NULL);
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes = {0} PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * We don't update the WAL usage portion of the local WalStats
+     * elsewhere. Instead, fill in that portion with the difference of
+     * pgWalUsage since the previous call.
+     */
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    prevWalUsage = pgWalUsage;
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -846,150 +1138,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -997,282 +1538,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1281,81 +1678,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1364,53 +1741,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1419,17 +1889,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1445,15 +1935,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1469,20 +1984,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1500,15 +2014,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1522,48 +2057,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1574,9 +2154,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1584,10 +2166,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1605,137 +2187,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1747,55 +2415,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1810,25 +2467,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        memset(&hash_ctl, 0, sizeof(hash_ctl));
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1842,31 +2483,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1907,9 +2554,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1921,8 +2565,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1938,7 +2581,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1949,121 +2593,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        memset(&ctl, 0, sizeof(ctl));
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2478,8 +3061,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2494,8 +3075,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2540,7 +3121,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2576,7 +3157,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2596,85 +3177,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2683,30 +3317,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2766,53 +3416,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2826,9 +3583,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2840,13 +3615,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3066,8 +3869,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3243,12 +4046,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3261,13 +4067,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3278,6 +4093,10 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    cleanup_dropped_stats_entries();
+
+    detach_shared_stats(true);
 }
 
 
@@ -3538,7 +4357,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3833,8 +4653,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4488,94 +5308,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4584,538 +5390,136 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-
-    WalUsage    walusage;
-
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5135,7 +5539,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5145,200 +5549,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5372,114 +5648,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5488,682 +5713,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6180,906 +5873,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx >= 0 && idx < max_replication_slots);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-               sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-        Assert(nReplSlotStats >= 0);
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7126,60 +5938,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7224,7 +5982,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7235,7 +5993,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7245,41 +6003,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8a4706a870..65801817e7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,7 +250,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -511,7 +510,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1781,11 +1773,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2715,8 +2702,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3043,8 +3028,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3111,13 +3094,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3190,22 +3166,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3666,22 +3626,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3901,8 +3845,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3926,8 +3868,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3937,8 +3878,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4139,8 +4079,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5085,18 +5023,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5215,12 +5141,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6129,7 +6049,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6185,8 +6104,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6419,7 +6336,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1d8d1742a7..4e5d63b30e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9c7cf13d4d..0d978ba2f2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..aef6be94a3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 108e652179..33b19034d5 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..0dec4b9145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -415,8 +415,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..891fe67a41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3214,6 +3214,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3784,6 +3790,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4180,11 +4187,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4216,6 +4224,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4228,8 +4238,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4267,7 +4282,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4275,6 +4290,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f56..658f4d432e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1708,7 +1707,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1733,11 +1732,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2018,7 +2017,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2106,7 +2105,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2176,7 +2175,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2209,7 +2208,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66393becfb..92a518433e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2354,6 +2355,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 82d451569d..eb41aec4d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -608,6 +609,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1222,6 +1225,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dabcbb0736..6dbb61a99d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..5b16c09ccc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -578,7 +578,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..26603e95e4 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..9bba4785d0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +30,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +129,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,353 +159,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -548,7 +225,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -564,101 +240,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -667,13 +250,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -683,7 +262,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -693,29 +271,86 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -735,99 +370,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -836,7 +407,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -885,7 +456,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1137,7 +708,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1334,18 +905,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1360,33 +939,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1426,6 +999,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1548,9 +1122,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1559,15 +1134,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1578,5 +1158,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7f36e1146f..508c869d4c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.27.0

From f715cb0c407e92dcdcb449dede52e5d725d22c04 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v43 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 62711ee83f..fff553c6a5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9223,9 +9223,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b60382778..e6bf21b450 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7299,11 +7299,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7319,14 +7319,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7357,9 +7356,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8420,7 +8419,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8432,9 +8431,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..3a1dc17057 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,12 +2329,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a5366..a9cb25e3af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -637,7 +627,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1074,10 +1064,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1832,6 +1818,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5989,9 +5979,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..ad105cb2a6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.27.0

From 0b7b66d32749dfe9833755d102016272df02d816 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v43 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 70cfdb2c9d..4860f1c33b 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -88,14 +88,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201126;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e6bf21b450..9c86ecac15 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7412,25 +7412,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 891118883c..97b8f3d132 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 4e5d63b30e..2b2761a588 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6dbb61a99d..e3c84f30e7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4299,17 +4296,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11692,35 +11678,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5b16c09ccc..91b8013b1e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,7 +585,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ee3bfa82f4..5b7eb30f14 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ba34dbac14..00aed706bb 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9bba4785d0..17f8afaf50 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ebcaeb44fe..8772fcc970 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From 7e1a30179a85da7162a44ef23e0459a8f5b0d817 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v43 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 2b2761a588..eaf4448943 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 11 Dec 2020 16:50:03 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> - Fixed some silly bugs of WAL statistics.

- Conflicted with b3817f5f77. Rebased.

- Make sure to clean up local reference hash before detaching shared
  stats memory. Forgetting this caused an assertion failure.

- Reduced the planned number of tests of pg_basebackup according to
  the previous reduction made in the directory list in the script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fb4cec37e03b921a9babeeb35c8141b19aa10dee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v44 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 78ccf03217..b829167872 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index b86df68e77..c337099061 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.27.0

From fda5f1bd717f9b751f94b39b3c8feffdee147e04 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v44 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b829167872..9c90096f3d 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c337099061..493e974832 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From ae704546432ce5283e8d18954baf675c25a75d91 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v44 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index f39dc4ddf1..d69e20961d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -490,8 +492,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index a7ed93fdc1..edda899554 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index ed1b65358d..e3a520def9 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5d09822c81..8a4706a870 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -547,6 +547,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1787,7 +1788,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3041,7 +3042,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3176,20 +3177,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3437,7 +3434,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3642,6 +3639,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3938,6 +3947,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5185,7 +5195,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5230,16 +5240,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5481,6 +5481,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 7dc3911590..fc23539137 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -181,6 +181,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..65443063e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 1c67de2ede..54ce0b97d7 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..bba002ad24 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -418,6 +418,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -430,6 +431,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index b3200874ca..e3ffc63f14 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 56c5ec4481..c691acf8cd 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index e77f76ae8a..a656910d02 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -354,6 +354,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.27.0

From dc5dd0d75f834a83df26b98d745cb2c55f9ca9b7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v44 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6135 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   37 +-
 src/backend/utils/adt/pgstatfuncs.c           |   95 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  727 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2789 insertions(+), 4551 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3eea215b85..a0eeafe524 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 25f2d5df1b..9ce56b54b6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b1e5d2dbff..d2fb0c6bb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,7 +2207,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8592,8 +8592,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 731610c701..f8e6784975 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..a7e787d9d1 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f27c3fe8c1..de0c749570 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index cfc63915f3..2eef6c1654 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..925c75e296 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ed127a1032..71f6494889 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2114,8 +2100,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2198,8 +2184,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2758,29 +2744,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2984,17 +2947,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3024,7 +2980,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3034,8 +2990,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index a7afa758b6..b075e85839 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index a62c6d4d0a..e2e440569a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1257,8 +1251,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index e3a520def9..6d88c65d5f 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d87d9d06ee..ce4c6988f3 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -133,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
@@ -170,73 +135,246 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -273,11 +411,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -286,23 +421,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -311,40 +432,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -354,491 +492,645 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    cleanup_dropped_stats_entries();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes = {0} PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * We don't update the WAL usage portion of the local WalStats
+     * elsewhere. Instead, fill in that portion with the difference of
+     * pgWalUsage since the previous call.
+     */
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    prevWalUsage = pgWalUsage;
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -846,150 +1138,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -997,281 +1538,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1280,81 +1678,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,53 +1741,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1418,17 +1889,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1444,15 +1935,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1468,20 +1984,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1499,15 +2014,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1521,48 +2057,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1573,9 +2154,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1583,10 +2166,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1604,137 +2187,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1746,55 +2415,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1809,24 +2467,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1840,31 +2483,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1905,9 +2554,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1919,8 +2565,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1936,7 +2581,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1947,120 +2593,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2475,8 +3061,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2491,8 +3075,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2537,7 +3121,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2573,7 +3157,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2593,85 +3177,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2680,30 +3317,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2763,53 +3416,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2823,9 +3583,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2837,13 +3615,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3063,8 +3869,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3240,12 +4046,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3258,13 +4067,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3275,6 +4093,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3535,7 +4355,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3830,8 +4651,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4485,94 +5306,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4581,537 +5388,136 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-
-    WalUsage    walusage;
-
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5131,7 +5537,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5141,200 +5547,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5368,113 +5646,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5483,681 +5711,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6174,906 +5871,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx < nReplSlotStats);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        if (idx < nReplSlotStats - 1)
-            memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-                   sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7120,60 +5936,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7218,7 +5980,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7229,7 +5991,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7239,41 +6001,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 8a4706a870..65801817e7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -250,7 +250,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -511,7 +510,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1326,12 +1324,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1781,11 +1773,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2715,8 +2702,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3043,8 +3028,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3111,13 +3094,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3190,22 +3166,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3666,22 +3626,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3901,8 +3845,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3926,8 +3868,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3937,8 +3878,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4139,8 +4079,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5085,18 +5023,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5215,12 +5141,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6129,7 +6049,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6185,8 +6104,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6419,7 +6336,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1d8d1742a7..4e5d63b30e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 9c7cf13d4d..0d978ba2f2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c5e8707151..2e3504ac69 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 96c2aaabbd..55693cfa3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 26bcce9735..cc515d433f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 072bdd118f..4493c863b6 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..891fe67a41 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3214,6 +3214,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3784,6 +3790,7 @@ PostgresMain(int argc, char *argv[],
     sigjmp_buf    local_sigjmp_buf;
     volatile bool send_ready_for_query = true;
     bool        disable_idle_in_transaction_timeout = false;
+    bool        disable_idle_stats_update_timeout = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4180,11 +4187,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          */
         if (send_ready_for_query)
         {
@@ -4216,6 +4224,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4228,8 +4238,13 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    disable_idle_stats_update_timeout = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
             }
@@ -4267,7 +4282,7 @@ PostgresMain(int argc, char *argv[],
         DoingCommandRead = false;
 
         /*
-         * (5) turn off the idle-in-transaction timeout
+         * (5) turn off the idle-in-transaction timeout and stats update timeout
          */
         if (disable_idle_in_transaction_timeout)
         {
@@ -4275,6 +4290,12 @@ PostgresMain(int argc, char *argv[],
             disable_idle_in_transaction_timeout = false;
         }
 
+        if (disable_idle_stats_update_timeout)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            disable_idle_stats_update_timeout = false;
+        }
+
         /*
          * (6) check for any other interesting events that happened while we
          * slept.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6afe1b6f56..658f4d432e 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1632,69 +1629,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1708,7 +1707,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1733,11 +1732,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2018,7 +2017,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2106,7 +2105,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2176,7 +2175,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2209,7 +2208,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3bd5e18042..b647bae7db 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2353,6 +2354,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..883c7f8802 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index ed2ab4b5b2..74fb22f216 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 82d451569d..eb41aec4d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -608,6 +609,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1222,6 +1225,14 @@ IdleInTransactionSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index dabcbb0736..6dbb61a99d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -743,8 +743,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1454,7 +1454,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1465,7 +1465,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1474,7 +1474,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4300,7 +4300,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4636,7 +4636,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b7fb2ec1fe..5b16c09ccc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -578,7 +578,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..c61b828bf3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 108;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index bba002ad24..b70c495f2a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -83,6 +83,8 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -321,7 +323,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5954068dec..9bba4785d0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2020, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +30,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +129,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,353 +159,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -548,7 +225,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -564,101 +240,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -667,13 +250,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -683,7 +262,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -693,29 +271,86 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -735,99 +370,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -836,7 +407,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -885,7 +456,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1137,7 +708,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1334,18 +905,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1360,33 +939,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1426,6 +999,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1548,9 +1122,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1559,15 +1134,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1578,5 +1158,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index af9b41795d..621e074111 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 7f36e1146f..508c869d4c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..77d1572a99 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
     STANDBY_TIMEOUT,
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.27.0

From 4fc92610ca5a766cf5f571af23525d3515d452fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v44 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index d988636046..b9e4eb1040 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9234,9 +9234,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b60382778..e6bf21b450 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7299,11 +7299,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7319,14 +7319,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7357,9 +7356,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8420,7 +8419,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8432,9 +8431,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 19d7bd2b28..3a1dc17057 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2329,12 +2329,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 52a69a5366..a9cb25e3af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -637,7 +627,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1074,10 +1064,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1832,6 +1818,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5989,9 +5979,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..ad105cb2a6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.27.0

From 31293ea1e2434a8d77660da9ae9d9ebb99f0771c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v44 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 00eeaf1057..8a45d662ba 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -89,14 +89,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201218;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..dd3d8892d8 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e6bf21b450..9c86ecac15 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7412,25 +7412,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ce4c6988f3..ad3babffa0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 4e5d63b30e..2b2761a588 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6dbb61a99d..e3c84f30e7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -202,7 +202,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -558,8 +557,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4299,17 +4296,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11692,35 +11678,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5b16c09ccc..91b8013b1e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -585,7 +585,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f994c4216b..8b7c798287 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ba34dbac14..00aed706bb 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 9bba4785d0..17f8afaf50 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..dd41a43b4e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From 6b79aa6988dd4550771fad9c28d603a2d39d0f1e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v44 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 2b2761a588..eaf4448943 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> - Conflicted with b3817f5f77. Rebased.

Conflicted with 9877374bef. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 94af7bbd848ac14582307807e216b8b36788f7a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v45 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index e0c763be32..520bfa0979 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c069ec9de7..a6ea377173 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.27.0

From c0acc58bff709bd113e9d44daf4e2b33ed9aac18 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v45 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 520bfa0979..853d78b528 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index a6ea377173..5b8114d041 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From 7d81265d3bcf3cbdb55f296cc548c4e691cfade1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v45 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8b5a..d01859bde5 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -490,8 +492,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6f615e6622..41da0c5059 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index edec311f12..9a2e21bf86 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7de27ee4e0..af91c313e2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -548,6 +548,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1788,7 +1789,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3027,7 +3028,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3162,20 +3163,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3423,7 +3420,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3629,6 +3626,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3932,6 +3941,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5160,7 +5170,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5214,16 +5224,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5465,6 +5465,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index db0cfaa360..aabf9d73eb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..551f518cc2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 3edd1a976c..1a59181cf9 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..adb9f819bb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -419,6 +419,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -431,6 +432,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index d102a21ab7..385b002dfe 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index dbbed18f61..8ed4d87ae6 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0786fcf103..430d438303 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -354,6 +354,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.27.0

From cff6ae8866128ffd32054c3eb6f557d341c079fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v45 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6135 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   41 +-
 src/backend/utils/adt/pgstatfuncs.c           |   95 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  727 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2791 insertions(+), 4553 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 10ddde4ecf..98df20ce96 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..ceccb7e19e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede93ad7fd..5d775ba84c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2207,7 +2207,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8592,8 +8592,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cffbc0ac38..3f55c34909 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..308b4ab034 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..acea4de382 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c5c25ce11d..1464b97c7f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b97d48ee01..e80ab5edd0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca561..194a02024b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2114,8 +2100,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2198,8 +2184,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2758,29 +2744,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2984,17 +2947,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3024,7 +2980,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3034,8 +2990,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..679992dc89 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..309a9997e1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1257,8 +1251,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9a2e21bf86..504443dcc0 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1..ecf9d9adcc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -133,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
@@ -170,73 +135,246 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -273,11 +411,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -286,23 +421,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -311,40 +432,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -354,491 +492,645 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    cleanup_dropped_stats_entries();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes = {0} PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * We don't update the WAL usage portion of the local WalStats
+     * elsewhere. Instead, fill in that portion with the difference of
+     * pgWalUsage since the previous call.
+     */
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    prevWalUsage = pgWalUsage;
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -846,150 +1138,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -997,281 +1538,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1280,81 +1678,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,53 +1741,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1418,17 +1889,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1444,15 +1935,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1468,20 +1984,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1499,15 +2014,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1521,48 +2057,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1573,9 +2154,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1583,10 +2166,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1604,137 +2187,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1746,55 +2415,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1809,24 +2467,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1840,31 +2483,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1905,9 +2554,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1919,8 +2565,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1936,7 +2581,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1947,120 +2593,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2475,8 +3061,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2491,8 +3075,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2537,7 +3121,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2573,7 +3157,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2593,85 +3177,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2680,30 +3317,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2763,53 +3416,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2823,9 +3583,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2837,13 +3615,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3068,8 +3874,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3246,12 +4052,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3264,13 +4073,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3281,6 +4099,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3541,7 +4361,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3836,8 +4657,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4491,94 +5312,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4587,537 +5394,136 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-
-    WalUsage    walusage;
-
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5137,7 +5543,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5147,200 +5553,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5374,113 +5652,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5489,681 +5717,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6180,906 +5877,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx < nReplSlotStats);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        if (idx < nReplSlotStats - 1)
-            memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-                   sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7126,60 +5942,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7224,7 +5986,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7235,7 +5997,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7245,41 +6007,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index af91c313e2..9b9c9b1c11 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,7 +251,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -512,7 +511,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1327,12 +1325,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1782,11 +1774,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2699,8 +2686,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3029,8 +3014,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3097,13 +3080,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3176,22 +3152,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3653,22 +3613,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3895,8 +3839,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3920,8 +3862,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3931,8 +3872,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4133,8 +4073,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5067,18 +5005,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5191,12 +5117,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6113,7 +6033,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6169,8 +6088,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6403,7 +6320,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635550..d21801cf90 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e00c7ffc01..608818beea 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 71b5852224..160977d50d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2047,7 +2047,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2157,7 +2157,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2347,7 +2347,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2355,7 +2355,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..78cfe91eab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db7e59f8b7..e6e4c0fb04 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0f31ff3822..fc96f21519 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     DropRelFileNodesAllBuffers(rnodes, nrels);
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 28055680aa..8b51852505 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3271,6 +3271,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3842,6 +3848,7 @@ PostgresMain(int argc, char *argv[],
     volatile bool send_ready_for_query = true;
     bool        idle_in_transaction_timeout_enabled = false;
     bool        idle_session_timeout_enabled = false;
+    bool        idle_stats_update_timeout_enabled = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4238,11 +4245,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          *
          * Also, if an idle timeout is enabled, start the timer for that.
          */
@@ -4276,6 +4284,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4288,8 +4298,14 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                /* Start the idle-stats-update timer */
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    idle_stats_update_timeout_enabled = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -4323,9 +4339,9 @@ PostgresMain(int argc, char *argv[],
         firstchar = ReadCommand(&input_message);
 
         /*
-         * (4) turn off the idle-in-transaction and idle-session timeouts, if
-         * active.  We do this before step (5) so that any last-moment timeout
-         * is certain to be detected in step (5).
+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so
+         * that any last-moment timeout is certain to be detected in step (5).
          *
          * At most one of these timeouts will be active, so there's no need to
          * worry about combining the timeout.c calls into one.
@@ -4340,6 +4356,11 @@ PostgresMain(int argc, char *argv[],
             disable_timeout(IDLE_SESSION_TIMEOUT, false);
             idle_session_timeout_enabled = false;
         }
+        if (idle_stats_update_timeout_enabled)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            idle_stats_update_timeout_enabled = false;
+        }
 
         /*
          * (5) disable async signal conditions again.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5c12a165a1..fd6465dfb9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1269,7 +1266,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1285,7 +1282,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1301,7 +1298,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1317,7 +1314,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1333,7 +1330,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1349,7 +1346,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1365,7 +1362,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1381,7 +1378,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1397,7 +1394,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1430,7 +1427,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1446,7 +1443,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1461,7 +1458,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1476,7 +1473,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1491,7 +1488,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1506,7 +1503,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1521,7 +1518,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1536,11 +1533,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1555,7 +1552,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1573,7 +1570,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1610,7 +1607,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1626,7 +1623,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1634,69 +1631,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1710,7 +1709,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1735,11 +1734,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2020,7 +2019,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2108,7 +2107,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2178,7 +2177,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2211,7 +2210,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 7ef510cd01..0762c2034c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2366,6 +2367,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index ea28769d6a..997afcab6d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -34,6 +34,7 @@ volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t IdleSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0f67b99cc5..2567668b6c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e5965bc517..d4c17fd7ab 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void IdleSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -621,6 +622,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..85299e2138 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1457,7 +1457,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1468,7 +1468,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1477,7 +1477,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4356,7 +4356,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4692,7 +4692,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4f5b6bdb12 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -580,7 +580,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..c61b828bf3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 108;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index adb9f819bb..12708b9470 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -84,6 +84,8 @@ extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -322,7 +324,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c38b689710..0472b728bf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +30,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +129,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,353 +159,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -548,7 +225,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -564,101 +240,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -667,13 +250,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -683,7 +262,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -693,29 +271,86 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -735,99 +370,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -836,7 +407,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -885,7 +456,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1138,7 +709,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1335,18 +906,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1361,33 +940,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1427,6 +1000,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);

 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1549,9 +1123,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1560,15 +1135,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1579,5 +1159,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..9ed6b54428 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index b9b5c1adda..add9c53ee3 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index ecb2a366a5..f090f7372a 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     IDLE_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.27.0

From 6ce27f6ba6d436d2eb5447c6b97703b28e58d6f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v45 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3a2266526c..4d8b92df72 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9234,9 +9234,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7c0a673a8d..f6c80df988 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7327,11 +7327,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7347,14 +7347,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7385,9 +7384,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8485,7 +8484,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8497,9 +8496,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index efc382cb8d..6c620469eb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2338,12 +2338,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3cdb1aff3c..afa8c35127 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -643,7 +633,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1080,10 +1070,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1838,6 +1824,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5996,9 +5986,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index dcb25dc3cd..7507783eaa 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.27.0

From af6d0c11a5af7d4a0861fca0ea2fba248390ae00 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v45 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 72a117fc19..0a98b2f2c0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -89,14 +89,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201218;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3c8aaed0b6..7557a375f0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f6c80df988..906f893891 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7440,25 +7440,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ecf9d9adcc..73b44a2652 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d21801cf90..d2c3064678 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 85299e2138..16e430fb28 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -203,7 +203,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -560,8 +559,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4355,17 +4352,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11773,35 +11759,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f5b6bdb12..20c24a9d78 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -587,7 +587,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c854221a30..0f42e78d19 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2618b4c957..ab5cb51de7 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0472b728bf..d7c50eb4f9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..dd41a43b4e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From 2b8f25a0131bc6077d8ed4ddc5a66b80917686cd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v45 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d2c3064678..25677c5c6e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 08 Jan 2021 10:24:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > - Conflicted with b3817f5f77. Rebased.
> 
> Conflicted with 9877374bef. Rebased.

bea449c635 conflicted with this (on a comment change). Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e2a26f5aa7f7d26e894aa998d7d0afa5374c2e76 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v46 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index e0c763be32..520bfa0979 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c069ec9de7..a6ea377173 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.27.0

From 4415d5736f486cdcfcb2a7e3c995eaba913274ab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v46 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 520bfa0979..853d78b528 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index a6ea377173..5b8114d041 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From 0ae3df9f2544ebc76bf2e3b36f9e0f79b81ac68f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v46 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8b5a..d01859bde5 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -490,8 +492,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6f615e6622..41da0c5059 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index edec311f12..9a2e21bf86 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7de27ee4e0..af91c313e2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -548,6 +548,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1788,7 +1789,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3027,7 +3028,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3162,20 +3163,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3423,7 +3420,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3629,6 +3626,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3932,6 +3941,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5160,7 +5170,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5214,16 +5224,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5465,6 +5465,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c87ffc6549..a1e51c5b99 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..551f518cc2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 3edd1a976c..1a59181cf9 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..adb9f819bb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -419,6 +419,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -431,6 +432,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index d102a21ab7..385b002dfe 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index dbbed18f61..8ed4d87ae6 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0786fcf103..430d438303 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -354,6 +354,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.27.0

From c9a89402b269250551258964ccc80e95ce438151 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v46 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6135 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   41 +-
 src/backend/utils/adt/pgstatfuncs.c           |   95 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  727 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2791 insertions(+), 4553 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..a8c882d315 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..ceccb7e19e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..a4ab63823c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2208,7 +2208,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8608,8 +8608,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cffbc0ac38..3f55c34909 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..308b4ab034 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..acea4de382 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c5c25ce11d..1464b97c7f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index b97d48ee01..e80ab5edd0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca561..194a02024b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2114,8 +2100,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2198,8 +2184,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2758,29 +2744,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2984,17 +2947,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3024,7 +2980,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3034,8 +2990,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..679992dc89 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..309a9997e1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1257,8 +1251,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9a2e21bf86..504443dcc0 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1..ecf9d9adcc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -133,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
@@ -170,73 +135,246 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -273,11 +411,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -286,23 +421,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -311,40 +432,57 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
 
 static void pgstat_setup_memcxt(void);
 
@@ -354,491 +492,645 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    cleanup_dropped_stats_entries();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes = {0} PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * We don't update the WAL usage portion of the local WalStats
+     * elsewhere. Instead, fill in that portion with the difference of
+     * pgWalUsage since the previous call.
+     */
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    prevWalUsage = pgWalUsage;
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -846,150 +1138,399 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- * ----------
+ *    ----------
  */
-void
+long
 pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
-
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+
+    if (!force)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
-
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
-
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
+
+    if (pgStatLocalHash)
+    {
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            bool        remove = false;
+
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
+
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
+    }
+
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
+    /*
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
+     */
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
     }
 
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    sharedent->counts.n_tuples_returned += localent->counts.n_tuples_returned;
+    sharedent->counts.n_tuples_fetched += localent->counts.n_tuples_fetched;
+    sharedent->counts.n_tuples_inserted += localent->counts.n_tuples_inserted;
+    sharedent->counts.n_tuples_updated += localent->counts.n_tuples_updated;
+    sharedent->counts.n_tuples_deleted += localent->counts.n_tuples_deleted;
+    sharedent->counts.n_blocks_fetched += localent->counts.n_blocks_fetched;
+    sharedent->counts.n_blocks_hit += localent->counts.n_blocks_hit;
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    sharedent->counts.n_deadlocks += localent->counts.n_deadlocks;
+    sharedent->counts.n_temp_bytes += localent->counts.n_temp_bytes;
+    sharedent->counts.n_temp_files += localent->counts.n_temp_files;
+    sharedent->counts.n_checksum_failures += localent->counts.n_checksum_failures;
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -997,281 +1538,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1280,81 +1678,61 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
+    Assert(OidIsValid(databaseid));
 
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
-#endif                            /* NOT_USED */
 
 
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1363,53 +1741,146 @@ pgstat_drop_relation(Oid relid)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1418,17 +1889,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1444,15 +1935,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1468,20 +1984,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1499,15 +2014,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1521,48 +2057,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1573,9 +2154,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1583,10 +2166,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1604,137 +2187,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1746,55 +2415,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1809,24 +2467,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1840,31 +2483,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1905,9 +2554,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1919,8 +2565,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1936,7 +2581,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -1947,120 +2593,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2475,8 +3061,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2491,8 +3075,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2537,7 +3121,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2573,7 +3157,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2593,85 +3177,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2680,30 +3317,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2763,53 +3416,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2823,9 +3583,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2837,13 +3615,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3068,8 +3874,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3246,12 +4052,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3264,13 +4073,22 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
         pgstat_report_stat(true);
 
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
+
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
      * before and after.  We use a volatile pointer here to ensure the
@@ -3281,6 +4099,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3541,7 +4361,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3836,8 +4657,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4491,94 +5312,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4587,537 +5394,136 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-
-    WalUsage    walusage;
-
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
  * ----------
  */
-static void
-pgstat_send_slru(void)
+void
+pgstat_report_wal(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
+    flush_walstat(false);
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5137,7 +5543,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5147,200 +5553,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5374,113 +5652,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5489,681 +5717,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6180,906 +5877,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx < nReplSlotStats);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        if (idx < nReplSlotStats - 1)
-            memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-                   sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7126,60 +5942,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7224,7 +5986,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7235,7 +5997,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7245,41 +6007,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index af91c313e2..9b9c9b1c11 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,7 +251,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -512,7 +511,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1327,12 +1325,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1782,11 +1774,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2699,8 +2686,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3029,8 +3014,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3097,13 +3080,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3176,22 +3152,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3653,22 +3613,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3895,8 +3839,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3920,8 +3862,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3931,8 +3872,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4133,8 +4073,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5067,18 +5005,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5191,12 +5117,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6113,7 +6033,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6169,8 +6088,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6403,7 +6320,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635550..d21801cf90 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e00c7ffc01..608818beea 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..517354bed2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2059,7 +2059,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2169,7 +2169,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2359,7 +2359,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2367,7 +2367,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..78cfe91eab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db7e59f8b7..e6e4c0fb04 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..75d1695576 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     }
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 28055680aa..8b51852505 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3271,6 +3271,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3842,6 +3848,7 @@ PostgresMain(int argc, char *argv[],
     volatile bool send_ready_for_query = true;
     bool        idle_in_transaction_timeout_enabled = false;
     bool        idle_session_timeout_enabled = false;
+    bool        idle_stats_update_timeout_enabled = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4238,11 +4245,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          *
          * Also, if an idle timeout is enabled, start the timer for that.
          */
@@ -4276,6 +4284,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4288,8 +4298,14 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                /* Start the idle-stats-update timer */
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    idle_stats_update_timeout_enabled = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -4323,9 +4339,9 @@ PostgresMain(int argc, char *argv[],
         firstchar = ReadCommand(&input_message);
 
         /*
-         * (4) turn off the idle-in-transaction and idle-session timeouts, if
-         * active.  We do this before step (5) so that any last-moment timeout
-         * is certain to be detected in step (5).
+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so
+         * that any last-moment timeout is certain to be detected in step (5).
          *
          * At most one of these timeouts will be active, so there's no need to
          * worry about combining the timeout.c calls into one.
@@ -4340,6 +4356,11 @@ PostgresMain(int argc, char *argv[],
             disable_timeout(IDLE_SESSION_TIMEOUT, false);
             idle_session_timeout_enabled = false;
         }
+        if (idle_stats_update_timeout_enabled)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            idle_stats_update_timeout_enabled = false;
+        }
 
         /*
          * (5) disable async signal conditions again.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5c12a165a1..fd6465dfb9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1269,7 +1266,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1285,7 +1282,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1301,7 +1298,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1317,7 +1314,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1333,7 +1330,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1349,7 +1346,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1365,7 +1362,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1381,7 +1378,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1397,7 +1394,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1430,7 +1427,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1446,7 +1443,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1461,7 +1458,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1476,7 +1473,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1491,7 +1488,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1506,7 +1503,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1521,7 +1518,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1536,11 +1533,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1555,7 +1552,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1573,7 +1570,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1610,7 +1607,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1626,7 +1623,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1634,69 +1631,71 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1710,7 +1709,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1735,11 +1734,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2020,7 +2019,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2108,7 +2107,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2178,7 +2177,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2211,7 +2210,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 7ef510cd01..0762c2034c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2366,6 +2367,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index ea28769d6a..997afcab6d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -34,6 +34,7 @@ volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t IdleSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0f67b99cc5..2567668b6c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e5965bc517..d4c17fd7ab 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void IdleSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -621,6 +622,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..85299e2138 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1457,7 +1457,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1468,7 +1468,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1477,7 +1477,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4356,7 +4356,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4692,7 +4692,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4f5b6bdb12 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -580,7 +580,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..c61b828bf3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 108;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index adb9f819bb..12708b9470 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -84,6 +84,8 @@ extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -322,7 +324,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c38b689710..0472b728bf 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +30,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -41,38 +44,6 @@ typedef enum TrackFunctionsLevel
     TRACK_FUNC_ALL
 }            TrackFunctionsLevel;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -83,9 +54,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -159,10 +129,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -186,353 +159,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -548,7 +225,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -564,101 +240,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -667,13 +250,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BC9F
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -683,7 +262,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -693,29 +271,86 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -735,99 +370,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -836,7 +407,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -885,7 +456,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1138,7 +709,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1335,18 +906,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1361,33 +940,27 @@ extern PgStat_Counter pgStatBlockWriteTime;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1427,6 +1000,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1549,9 +1123,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1560,15 +1135,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1579,5 +1159,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..9ed6b54428 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index b9b5c1adda..add9c53ee3 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index ecb2a366a5..f090f7372a 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     IDLE_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.27.0

From 9f8d6e98cdf95837460d49d78f7943b2200bfa80 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v46 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3a2266526c..4d8b92df72 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9234,9 +9234,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..b0c25c9c5c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7327,11 +7327,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7347,14 +7347,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7385,9 +7384,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8485,7 +8484,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8497,9 +8496,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index efc382cb8d..6c620469eb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2338,12 +2338,13 @@ LOG:  database system is ready to accept read only connections
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3cdb1aff3c..afa8c35127 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -643,7 +633,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1080,10 +1070,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1838,6 +1824,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5996,9 +5986,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bcbb7a25fb..1fa59a2fdf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.27.0

From dcd64c09f489d605f2a8fe1d9e802b8488d8bd18 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v46 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 72a117fc19..0a98b2f2c0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -89,14 +89,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201218;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3c8aaed0b6..7557a375f0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b0c25c9c5c..084bc57779 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7440,25 +7440,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ecf9d9adcc..73b44a2652 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d21801cf90..d2c3064678 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 85299e2138..16e430fb28 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -203,7 +203,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -560,8 +559,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4355,17 +4352,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11773,35 +11759,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f5b6bdb12..20c24a9d78 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -587,7 +587,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c854221a30..0f42e78d19 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -217,7 +217,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2618b4c957..ab5cb51de7 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0472b728bf..d7c50eb4f9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..dd41a43b4e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From e3f7a44de8cc916dbd5a4d67a15c9a3a01d21fea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v46 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d2c3064678..25677c5c6e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Thu, 14 Jan 2021 15:14:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 08 Jan 2021 10:24:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > - Conflicted with b3817f5f77. Rebased.
> > 
> > Conflicted with 9877374bef. Rebased.
> 
> bea449c635 conflicted with this (on a comment change). Rebased.

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fdfeaea6b993a7793b1ed8ac3a44ffaac00e840f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v47 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().
---
 src/backend/lib/dshash.c | 160 ++++++++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h |  22 ++++++
 2 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index e0c763be32..520bfa0979 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,156 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition    = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c069ec9de7..a6ea377173 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.27.0

From 40e9ef6c498e6e6104a20932cfab13b2658228fe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v47 2/7] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 520bfa0979..853d78b528 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index a6ea377173..5b8114d041 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From c2a76c75382d41a611edb794118cb14c6cd908f4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:59:38 +0900
Subject: [PATCH v47 3/7] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.

Archiver process must be a auxiliary process since it uses shared
memory after stats data was moved into shared-memory. Make the process
an auxiliary process in order to make it work.
---
 src/backend/access/transam/xlogarchive.c |   6 +-
 src/backend/bootstrap/bootstrap.c        |  22 ++--
 src/backend/postmaster/pgarch.c          | 130 +++--------------------
 src/backend/postmaster/postmaster.c      |  50 +++++----
 src/backend/storage/lmgr/proc.c          |   1 +
 src/include/access/xlog.h                |   3 +
 src/include/access/xlogarchive.h         |   1 +
 src/include/miscadmin.h                  |   2 +
 src/include/postmaster/pgarch.h          |   4 +-
 src/include/storage/pmsignal.h           |   1 -
 src/include/storage/proc.h               |   3 +
 11 files changed, 69 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8b5a..d01859bde5 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -29,7 +29,9 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/latch.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 
 /*
  * Attempt to retrieve the specified file from off-line archival storage.
@@ -490,8 +492,8 @@ XLogArchiveNotify(const char *xlog)
     }
 
     /* Notify archiver that it's got something to do */
-    if (IsUnderPostmaster)
-        SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+    if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+        SetLatch(ProcGlobal->archiverLatch);
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6f615e6622..41da0c5059 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -317,6 +317,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
         case StartupProcess:
             MyBackendType = B_STARTUP;
             break;
+        case ArchiverProcess:
+            MyBackendType = B_ARCHIVER;
+            break;
         case BgWriterProcess:
             MyBackendType = B_BG_WRITER;
             break;
@@ -437,30 +440,29 @@ AuxiliaryProcessMain(int argc, char *argv[])
             proc_exit(1);        /* should never return */
 
         case StartupProcess:
-            /* don't set signals, startup process has its own agenda */
             StartupProcessMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
+
+        case ArchiverProcess:
+            PgArchiverMain();
+            proc_exit(1);
 
         case BgWriterProcess:
-            /* don't set signals, bgwriter has its own agenda */
             BackgroundWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case CheckpointerProcess:
-            /* don't set signals, checkpointer has its own agenda */
             CheckpointerMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalWriterProcess:
-            /* don't set signals, walwriter has its own agenda */
             InitXLOGAccess();
             WalWriterMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         case WalReceiverProcess:
-            /* don't set signals, walreceiver has its own agenda */
             WalReceiverMain();
-            proc_exit(1);        /* should never return */
+            proc_exit(1);
 
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index edec311f12..9a2e21bf86 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -48,6 +48,7 @@
 #include "storage/latch.h"
 #include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/ps_status.h"
 
@@ -78,13 +79,11 @@
  * Local data
  * ----------
  */
-static time_t last_pgarch_start_time;
 static time_t last_sigterm_time = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
-static volatile sig_atomic_t wakened = false;
 static volatile sig_atomic_t ready_to_stop = false;
 
 /* ----------
@@ -95,8 +94,6 @@ static volatile sig_atomic_t ready_to_stop = false;
 static pid_t pgarch_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgarch_waken(SIGNAL_ARGS);
 static void pgarch_waken_stop(SIGNAL_ARGS);
 static void pgarch_MainLoop(void);
 static void pgarch_ArchiverCopyLoop(void);
@@ -110,75 +107,6 @@ static void pgarch_archiveDone(char *xlog);
  * ------------------------------------------------------------
  */
 
-/*
- * pgarch_start
- *
- *    Called from postmaster at startup or after an existing archiver
- *    died.  Attempt to fire up a fresh archiver process.
- *
- *    Returns PID of child process, or 0 if fail.
- *
- *    Note: if fail, we will be called again from the postmaster main loop.
- */
-int
-pgarch_start(void)
-{
-    time_t        curtime;
-    pid_t        pgArchPid;
-
-    /*
-     * Do nothing if no archiver needed
-     */
-    if (!XLogArchivingActive())
-        return 0;
-
-    /*
-     * Do nothing if too soon since last archiver start.  This is a safety
-     * valve to protect against continuous respawn attempts if the archiver is
-     * dying immediately at launch. Note that since we will be re-called from
-     * the postmaster main loop, we will get another chance later.
-     */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgarch_start_time) <
-        (unsigned int) PGARCH_RESTART_INTERVAL)
-        return 0;
-    last_pgarch_start_time = curtime;
-
-#ifdef EXEC_BACKEND
-    switch ((pgArchPid = pgarch_forkexec()))
-#else
-    switch ((pgArchPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork archiver: %m")));
-            return 0;
-
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
-
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
-
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgArchiverMain(0, NULL);
-            break;
-#endif
-
-        default:
-            return (int) pgArchPid;
-    }
-
-    /* shouldn't get here */
-    return 0;
-}
-
 /* ------------------------------------------------------------
  * Local functions called by archiver follow
  * ------------------------------------------------------------
@@ -212,14 +140,9 @@ pgarch_forkexec(void)
 #endif                            /* EXEC_BACKEND */
 
 
-/*
- * PgArchiverMain
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
- *    since we don't use 'em, it hardly matters...
- */
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+/* Main entry point for archiver process */
+void
+PgArchiverMain(void)
 {
     /*
      * Ignore all signals usually bound to some action in the postmaster,
@@ -231,33 +154,26 @@ PgArchiverMain(int argc, char *argv[])
     /* SIGQUIT handler was already set up by InitPostmasterChild */
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, pgarch_waken);
+    pqsignal(SIGUSR1, procsignal_sigusr1_handler);
     pqsignal(SIGUSR2, pgarch_waken_stop);
+
     /* Reset some signals that are accepted by postmaster but not here */
     pqsignal(SIGCHLD, SIG_DFL);
+
+    /* Unblock signals (they were blocked when the postmaster forked us) */
     PG_SETMASK(&UnBlockSig);
 
-    MyBackendType = B_ARCHIVER;
-    init_ps_display(NULL);
+    /*
+     * Advertise our latch that backends can use to wake us up while we're
+     * sleeping.
+     */
+    ProcGlobal->archiverLatch = &MyProc->procLatch;
 
     pgarch_MainLoop();
 
     exit(0);
 }
 
-/* SIGUSR1 signal handler for archiver process */
-static void
-pgarch_waken(SIGNAL_ARGS)
-{
-    int            save_errno = errno;
-
-    /* set flag that there is work to be done */
-    wakened = true;
-    SetLatch(MyLatch);
-
-    errno = save_errno;
-}
-
 /* SIGUSR2 signal handler for archiver process */
 static void
 pgarch_waken_stop(SIGNAL_ARGS)
@@ -282,14 +198,6 @@ pgarch_MainLoop(void)
     pg_time_t    last_copy_time = 0;
     bool        time_to_stop;
 
-    /*
-     * We run the copy loop immediately upon entry, in case there are
-     * unarchived files left over from a previous database run (or maybe the
-     * archiver died unexpectedly).  After that we wait for a signal or
-     * timeout before doing more.
-     */
-    wakened = true;
-
     /*
      * There shouldn't be anything for the archiver to do except to wait for a
      * signal ... however, the archiver exists to protect our data, so she
@@ -328,12 +236,8 @@ pgarch_MainLoop(void)
         }
 
         /* Do what we're here for */
-        if (wakened || time_to_stop)
-        {
-            wakened = false;
-            pgarch_ArchiverCopyLoop();
-            last_copy_time = time(NULL);
-        }
+        pgarch_ArchiverCopyLoop();
+        last_copy_time = time(NULL);
 
         /*
          * Sleep until a signal is received, or until a poll is forced by
@@ -354,13 +258,9 @@ pgarch_MainLoop(void)
                                WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
                                timeout * 1000L,
                                WAIT_EVENT_ARCHIVER_MAIN);
-                if (rc & WL_TIMEOUT)
-                    wakened = true;
                 if (rc & WL_POSTMASTER_DEATH)
                     time_to_stop = true;
             }
-            else
-                wakened = true;
         }
 
         /*
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 7de27ee4e0..af91c313e2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -548,6 +548,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #endif                            /* EXEC_BACKEND */
 
 #define StartupDataBase()        StartChildProcess(StartupProcess)
+#define StartArchiver()            StartChildProcess(ArchiverProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
@@ -1788,7 +1789,7 @@ ServerLoop(void)
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /* If we need to signal the autovacuum launcher, do so now */
         if (avlauncher_needs_signal)
@@ -3027,7 +3028,7 @@ reaper(SIGNAL_ARGS)
             if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
-                PgArchPID = pgarch_start();
+                PgArchPID = StartArchiver();
             if (PgStatPID == 0)
                 PgStatPID = pgstat_start();
 
@@ -3162,20 +3163,16 @@ reaper(SIGNAL_ARGS)
         }
 
         /*
-         * Was it the archiver?  If so, just try to start a new one; no need
-         * to force reset of the rest of the system.  (If fail, we'll try
-         * again in future cycles of the main loop.).  Unless we were waiting
-         * for it to shut down; don't restart it in that case, and
-         * PostmasterStateMachine() will advance to the next shutdown step.
+         * Was it the archiver?  Normal exit can be ignored; we'll start a new
+         * one at the next iteration of the postmaster's main loop, if
+         * necessary. Any other exit condition is treated as a crash.
          */
         if (pid == PgArchPID)
         {
             PgArchPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("archiver process"),
-                             pid, exitstatus);
-            if (PgArchStartupAllowed())
-                PgArchPID = pgarch_start();
+                HandleChildCrash(pid, exitstatus,
+                                 _("archiver process"));
             continue;
         }
 
@@ -3423,7 +3420,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -3629,6 +3626,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
+    /* Take care of the archiver too */
+    if (pid == PgArchPID)
+        PgArchPID = 0;
+    else if (PgArchPID != 0 && take_action)
+    {
+        ereport(DEBUG2,
+                (errmsg_internal("sending %s to process %d",
+                                 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+                                 (int) PgArchPID)));
+        signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+    }
+
     /*
      * Force a power-cycle of the pgarch process too.  (This isn't absolutely
      * necessary, but it seems like a good idea for robustness, and it
@@ -3932,6 +3941,7 @@ PostmasterStateMachine(void)
             Assert(CheckpointerPID == 0);
             Assert(WalWriterPID == 0);
             Assert(AutoVacPID == 0);
+            Assert(PgArchPID == 0);
             /* syslogger is not considered here */
             pmState = PM_NO_CHILDREN;
         }
@@ -5160,7 +5170,7 @@ sigusr1_handler(SIGNAL_ARGS)
          */
         Assert(PgArchPID == 0);
         if (XLogArchivingAlways())
-            PgArchPID = pgarch_start();
+            PgArchPID = StartArchiver();
 
         /*
          * If we aren't planning to enter hot standby mode later, treat
@@ -5214,16 +5224,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (StartWorkerNeeded || HaveCrashedWorker)
         maybe_start_bgworkers();
 
-    if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
-        PgArchPID != 0)
-    {
-        /*
-         * Send SIGUSR1 to archiver process, to wake it up and begin archiving
-         * next WAL file.
-         */
-        signal_child(PgArchPID, SIGUSR1);
-    }
-
     /* Tell syslogger to rotate logfile if requested */
     if (SysLoggerPID != 0)
     {
@@ -5465,6 +5465,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork startup process: %m")));
                 break;
+            case ArchiverProcess:
+                ereport(LOG,
+                        (errmsg("could not fork archiver process: %m")));
+                break;
             case BgWriterProcess:
                 ereport(LOG,
                         (errmsg("could not fork background writer process: %m")));
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c87ffc6549..a1e51c5b99 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
     ProcGlobal->startupBufferPinWaitBufId = -1;
     ProcGlobal->walwriterLatch = NULL;
     ProcGlobal->checkpointerLatch = NULL;
+    ProcGlobal->archiverLatch = NULL;
     pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
     pg_atomic_init_u32(&ProcGlobal->clogGroupFirst, INVALID_PGPROCNO);
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..551f518cc2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -339,6 +339,9 @@ extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool PromoteIsTriggered(void);
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
index 3edd1a976c..1a59181cf9 100644
--- a/src/include/access/xlogarchive.h
+++ b/src/include/access/xlogarchive.h
@@ -25,6 +25,7 @@ extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
 extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveWakeup(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..adb9f819bb 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -419,6 +419,7 @@ typedef enum
     BootstrapProcess,
     StartupProcess,
     BgWriterProcess,
+    ArchiverProcess,
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
@@ -431,6 +432,7 @@ extern AuxProcType MyAuxProcType;
 #define AmBootstrapProcess()        (MyAuxProcType == BootstrapProcess)
 #define AmStartupProcess()            (MyAuxProcType == StartupProcess)
 #define AmBackgroundWriterProcess() (MyAuxProcType == BgWriterProcess)
+#define AmArchiverProcess()            (MyAuxProcType == ArchiverProcess)
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index d102a21ab7..385b002dfe 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -32,8 +32,6 @@
  */
 extern int    pgarch_start(void);
 
-#ifdef EXEC_BACKEND
-extern void PgArchiverMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
+extern void PgArchiverMain(void) pg_attribute_noreturn();
 
 #endif                            /* _PGARCH_H */
diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index dbbed18f61..8ed4d87ae6 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -34,7 +34,6 @@ typedef enum
 {
     PMSIGNAL_RECOVERY_STARTED,    /* recovery has started */
     PMSIGNAL_BEGIN_HOT_STANDBY, /* begin Hot Standby */
-    PMSIGNAL_WAKEN_ARCHIVER,    /* send a NOTIFY signal to xlog archiver */
     PMSIGNAL_ROTATE_LOGFILE,    /* send SIGUSR1 to syslogger to rotate logfile */
     PMSIGNAL_START_AUTOVAC_LAUNCHER,    /* start an autovacuum launcher */
     PMSIGNAL_START_AUTOVAC_WORKER,    /* start an autovacuum worker */
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 683ab64f76..6cdaf3753d 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -355,6 +355,9 @@ typedef struct PROC_HDR
     int            startupProcPid;
     /* Buffer id of the buffer that Startup process waits for pin on, or -1 */
     int            startupBufferPinWaitBufId;
+    /* Archiver process's latch */
+    Latch       *archiverLatch;
+    /* Current shared estimate of appropriate spins_per_delay value */
 } PROC_HDR;
 
 extern PGDLLIMPORT PROC_HDR *ProcGlobal;
-- 
2.27.0

From 6d248248fcec1a64cdb568a6434d7ece7a4f10fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:14 +0900
Subject: [PATCH v47 4/7] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |    6 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/analyze.c                |    8 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   26 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6294 +++++++----------
 src/backend/postmaster/postmaster.c           |   88 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   41 +-
 src/backend/utils/adt/pgstatfuncs.c           |  109 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   14 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  744 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 35 files changed, 2866 insertions(+), 4668 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4a70e20a14..a8c882d315 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..ceccb7e19e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -598,7 +598,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..5801075a34 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2209,7 +2209,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -8658,8 +8658,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..207c7ec529 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1854,28 +1854,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..308b4ab034 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -644,10 +644,10 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
     }
 
     /*
-     * Report ANALYZE to the stats collector, too.  However, if doing
-     * inherited stats we shouldn't report, because the stats collector only
-     * tracks per-table stats.  Reset the changes_since_analyze counter only
-     * if we analyzed all columns; otherwise, there is still work for
+     * Report ANALYZE to the activity stats facility, too.  However, if doing
+     * inherited stats we shouldn't report, because the activity stats facility
+     * only tracks per-table stats.  Reset the changes_since_analyze counter
+     * only if we analyzed all columns; otherwise, there is still work for
      * auto-analyze to do.
      */
     if (!inh)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..acea4de382 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index c5c25ce11d..1464b97c7f 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 462f9a0f82..bbe6285e8a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -319,8 +319,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("VACUUM option DISABLE_PAGE_SKIPPING cannot be used with FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 47e60ca561..194a02024b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2114,8 +2100,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2198,8 +2184,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2758,29 +2744,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2984,17 +2947,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3024,7 +2980,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3034,8 +2990,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..679992dc89 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 54a818bf61..309a9997e1 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,17 +495,11 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
         /* Send WAL statistics to the stats collector. */
-        pgstat_send_wal();
+        pgstat_report_wal();
 
         /*
          * If any checkpoint flags have been set, redo the loop to handle the
@@ -711,9 +705,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1257,8 +1251,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 9a2e21bf86..504443dcc0 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -372,20 +372,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..8f431759c6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000 /* Initial retry interval after
+                                          * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -133,17 +104,11 @@ int            pgstat_track_activity_query_size = 1024;
  * ----------
  */
 char       *pgstat_stat_directory = NULL;
+
+/* No longer used, but will be removed with GUC */
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
@@ -170,73 +135,246 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats        entry[SLRU_NUM_ELEMENTS];
+    LWLock                    lock;
+    pg_atomic_uint32        changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;            /* # of processes that is attaching the
+                                     * shared stats memory */
+    /* Global stats structs */
+    PgStat_Archiver            archiver_stats;
+    pg_atomic_uint32        archiver_changecount;
+    PgStat_BgWriter            bgwriter_stats;
+    pg_atomic_uint32        bgwriter_changecount;
+    PgStat_CheckPointer        checkpointer_stats;
+    pg_atomic_uint32        checkpointer_changecount;
+    PgStat_Wal                wal_stats;
+    LWLock                    wal_stats_lock;
+    PgStatSharedSLRUStats    slru_stats;
+    pg_atomic_uint32        slru_changecount;
+    pg_atomic_uint64        stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver            archiver_reset_offset;
+    PgStat_BgWriter            bgwriter_reset_offset;
+    PgStat_CheckPointer        checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool                    attach_holdoff;
+    ConditionVariable        holdoff_cv;
+
+    pg_atomic_uint64        gc_count; /* # of entries deleted. not
+                                            * protected by StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal WalStats = {0} ;
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool     have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,            /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,        /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,    /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT    /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry),     /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)            /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry),            /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus),            /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),/* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)                /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;        /* statistics entry type */
+    Oid            databaseid;    /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;    /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey    key; /* hash key */
+    dsa_pointer        body;/* pointer to shared stats in PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey            key;    /* hash key */
+    char                    status;    /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* pointer to stats body in local heap */
+    dsa_pointer                dsapointer; /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash *pgStatEntHash = NULL;
+static int     pgStatEntHashAge = 0;        /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash *pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid oid;
+    char status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -276,11 +414,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -289,23 +424,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext    pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -314,41 +435,58 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey    stathashkey_zero = {0};
+static PgStatHashKey        cached_dbent_key = {0};
+static PgStat_StatDBEntry    cached_dbent;
+static PgStatHashKey        cached_tabent_key = {0};
+static PgStat_StatTabEntry    cached_tabent;
+static PgStatHashKey        cached_funcent_key = {0};
+static PgStat_StatFuncEntry    cached_funcent;
+
+static PgStat_Archiver        cached_archiverstats;
+static bool                    cached_archiverstats_is_valid = false;
+static PgStat_BgWriter        cached_bgwriterstats;
+static bool                    cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer    cached_checkpointerstats;
+static bool                    cached_checkpointerstats_is_valid = false;
+static PgStat_Wal            cached_walstats;
+static bool                    cached_walstats_is_valid = false;
+static PgStat_SLRUStats        cached_slrustats;
+static bool                    cached_slrustats_is_valid = false;
+static PgStat_ReplSlot       *cached_replslotstats = NULL;
+static int                    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader * get_stat_entry(PgStatTypes type, Oid dbid,
+                                               Oid objid, bool nowait,
+                                               bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader * get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                     Oid objid, bool create,
+                                                     bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-static void pgstat_send_connstats(bool disconnect, TimestampTz last_report);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash *collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void pgstat_update_connstats(bool disconnect);
 
 static void pgstat_setup_memcxt(void);
 
@@ -358,492 +496,645 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_connstat(PgStat_MsgConn *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),    &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the last
+     * detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it is
+     * the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry   *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    cleanup_dropped_stats_entries();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
+ *
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry           *shhashent;
+    PgStatLocalHashEntry   *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey            key;
+    bool                    shfound;
+
+    key.type        = type;
+    key.databaseid     = dbid;
+    key.objectid    = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64 currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the following
+         * loop takes so this is expected to iterate no more than twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry can
+         * be removed by a concurrent backend, but since we are creating an
+         * stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum after
+         * the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool                    lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Returns PID of child process, or 0 if fail.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes = {0} PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * We don't update the WAL usage portion of the local WalStats
+     * elsewhere. Instead, fill in that portion with the difference of
+     * pgWalUsage since the previous call.
+     */
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    prevWalUsage = pgWalUsage;
 
     /*
-     * Okay, fork off the collector.
+     * Clear out the statistics buffer, so it can be re-used.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    return true;
+}
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32    assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int i;
+
+    if (!have_slrustats)
+        return true;
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
 
-        default:
-            return (int) pgStatPid;
+    for (i = 0 ; i < SLRU_NUM_ELEMENTS ; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer                pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a
+     * lock. If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -851,159 +1142,414 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- *
- *    "disconnect" is "true" only for the last call before the backend
- *    exits.  This makes sure that no data is lost and that interrupted
- *    sessions are reported correctly.
- * ----------
+ *    ----------
  */
-void
-pgstat_report_stat(bool disconnect)
+long
+pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz    next_flush = 0;
+    static TimestampTz    pending_since = 0;
+    static long            retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
+
+    /*
+     * We need a database entry if the following stats exists.
+     */
+    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
+        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
+        get_local_dbstat_entry(MyDatabaseId);
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats && !disconnect)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the backend is about to exit.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!disconnect &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
 
-    /* for backends, send connection statistics */
+    if (!force)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* for backends, update connection statistics */
     if (MyBackendType == B_BACKEND)
-        pgstat_send_connstats(disconnect, last_report);
+        pgstat_update_connstats(false);
 
-    last_report = now;
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        List                   *dbentlist = NIL;
+        ListCell               *lc;
+        PgStatLocalHashEntry   *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local database stats. Just remember the
+                     * database entries for now then flush-out them later.
+                     */
+                    dbentlist = lappend(dbentlist, lent);
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        foreach(lc, dbentlist)
+        {
+            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+        list_free(dbentlist);
+        dbentlist = NULL;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    for (i = 0 ; i < 10 ; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
+    }
 
-    /* Send WAL statistics */
-    pgstat_send_wal();
+    /*
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
+     */
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid                    dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats;            /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+#define PGSTAT_ACCUM_DBCOUNT(sh, lo, item)        \
+    (sh)->counts.item += (lo)->counts.item
+
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) &ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_returned);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_fetched);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_inserted);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_updated);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_deleted);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_blocks_fetched);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_blocks_hit);
+
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_deadlocks);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_temp_bytes);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_temp_files);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_checksum_failures);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_session_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_active_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_idle_in_xact_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_abandoned);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_fatal);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_killed);
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -1011,281 +1557,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash       *dbids;    /* database ids */
+    pgstat_oid_hash       *relids;    /* relation ids in the current database */
+    pgstat_oid_hash       *funcids;/* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
+        pgstat_oid_hash *oidhash;
+        Oid           key;
 
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /* don't remove database entry for shared tables */
+                if (ent->key.databaseid == 0)
+                    continue;
+                oidhash = dbids;
+                key = ent->key.databaseid;
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                oidhash = relids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                oidhash = funcids;
+                key = ent->key.objectid;
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
         }
 
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
-        }
+        /* Skip existent objects. */
+        if (pgstat_oid_lookup(oidhash, key) != NULL)
+            continue;
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1294,123 +1697,60 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_send_connstats() -
- *
- *    Tell the collector about session statistics.
- *    The parameter "disconnect" will be true when the backend exits.
- *    "last_report" is the last time we were called (0 if never).
- * ----------
- */
-static void
-pgstat_send_connstats(bool disconnect, TimestampTz last_report)
-{
-    PgStat_MsgConn msg;
-    long        secs;
-    int            usecs;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CONNECTION);
-    msg.m_databaseid = MyDatabaseId;
-
-    /* session time since the last report */
-    TimestampDifference(((last_report == 0) ? MyStartTimestamp : last_report),
-                        GetCurrentTimestamp(),
-                        &secs, &usecs);
-    msg.m_session_time = secs * 1000000 + usecs;
-
-    msg.m_disconnect = disconnect ? pgStatSessionEndCause : DISCONNECT_NOT_YET;
-
-    msg.m_active_time = pgStatActiveTime;
-    pgStatActiveTime = 0;
-
-    msg.m_idle_in_xact_time = pgStatTransactionIdleTime;
-    pgStatTransactionIdleTime = 0;
-
-    /* report a new session only the first time */
-    msg.m_count = (last_report == 0) ? 1 : 0;
-
-    pgstat_send(&msg, sizeof(PgStat_MsgConn));
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1419,53 +1759,146 @@ pgstat_send_connstats(bool disconnect, TimestampTz last_report)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void*src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int before_changecount;
+    int after_changecount;
+
+    after_changecount =    pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1474,17 +1907,37 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1500,15 +1953,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int i;
+    TimestampTz    ts = GetCurrentTimestamp();
+    uint32    assert_changecount;PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0 ; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1524,20 +2002,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz    ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
-
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+            
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1555,15 +2032,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx ; i <= endidx ; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1577,48 +2075,93 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this vacuum from being canceled by a delayed vacuum report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1629,9 +2172,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid        dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1639,10 +2184,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1660,137 +2205,223 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter. Furthermore,
+     * this can prevent the stats updates made by the transactions that ends
+     * after this analyze from being canceled by a delayed analyze report.
+     * Update shared stats entry directly for the above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int                 i;
+    bool             found;
+
+    if (!area)
+        return;
+
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset dropped when we reuse it */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->header.dropped || !found)
+    {
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+        shent->header.dropped = false;
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1802,55 +2433,44 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0 ; i < max_replication_slots ; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    /*  XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
+        return;
+    
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    if (shent && !shent->header.dropped)
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->header.dropped = true;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1865,24 +2485,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1896,31 +2501,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1961,9 +2572,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1975,8 +2583,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1992,7 +2599,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -2003,120 +2611,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert (rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert (rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2531,8 +3079,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2547,8 +3093,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2593,7 +3139,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2629,7 +3175,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2649,85 +3195,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for InvalidOid */
+    Assert(dbid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2736,30 +3335,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid    dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2819,53 +3434,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean  -= reset.maxwritten_clean;
+    cached->buf_alloc          -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints       -= reset.timed_checkpoints;
+    cached->requested_checkpoints   -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend     -= reset.buf_written_backend;
+    cached->buf_fsync_backend       -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time   -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time    -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2879,9 +3601,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32 before_count;
+        uint32 after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2893,13 +3633,41 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int n = 0;
+        int i;
+
+        for (i = 0 ; i < max_replication_slots ; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (shent && !shent->header.dropped)
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3124,8 +3892,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3302,12 +4070,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3320,12 +4091,25 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
+    {
+        if (MyBackendType == B_BACKEND)
+            pgstat_update_connstats(true);
         pgstat_report_stat(true);
+    }
+
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3337,6 +4121,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3621,7 +4407,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3916,8 +4703,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4571,94 +5358,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4667,548 +5440,193 @@ pgstat_send_bgwriter(void)
 }
 
 /* ----------
- * pgstat_send_wal() -
+ * pgstat_report_checkpointer() -
  *
- *        Send WAL statistics to the collector
+ *        Report checkpointer statistics
  * ----------
  */
 void
-pgstat_send_wal(void)
+pgstat_report_checkpointer(void)
 {
     /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-
-    WalUsage    walusage;
-
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32    before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32    after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
      */
-    MemSet(&WalStats, 0, sizeof(WalStats));
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
 }
 
 /* ----------
- * pgstat_send_slru() -
+ * pgstat_report_wal() -
  *
- *        Send SLRU statistics to the collector
+ *        Report WAL statistics
+ * ----------
+ */
+void
+pgstat_report_wal(void)
+{
+    flush_walstat(false);
+}
+
+/* ----------
+ * pgstat_update_connstat() -
+ *
+ *        Update local connection stats
  * ----------
  */
 static void
-pgstat_send_slru(void)
+pgstat_update_connstats(bool disconnect)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
+    static TimestampTz    last_report = 0;
+    static SessionEndType    session_end_type = DISCONNECT_NOT_YET;
+    TimestampTz    now;
+    long        secs;
+    int            usecs;
+    PgStat_StatDBEntry *ldbstats;        /* local database entry */
 
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
+    Assert(MyBackendType == B_BACKEND);
+
+    if (session_end_type != DISCONNECT_NOT_YET)
+        return;
+
+    now = GetCurrentTimestamp();
+    if (last_report == 0)
+        last_report = MyStartTimestamp;
+    TimestampDifference(last_report, now, &secs, &usecs);
+    last_report = now;
+
+    if (disconnect)
+        session_end_type = pgStatSessionEndCause;
+
+    ldbstats = get_local_dbstat_entry(MyDatabaseId);
+
+    ldbstats->counts.n_sessions = (last_report == 0 ? 1 : 0);
+    ldbstats->counts.total_session_time += secs * 1000000 + usecs;
+    ldbstats->counts.total_active_time += pgStatActiveTime;
+    pgStatActiveTime = 0;
+    ldbstats->counts.total_idle_in_xact_time += pgStatTransactionIdleTime;
+    pgStatTransactionIdleTime = 0;
+
+    switch (session_end_type)
     {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
+        case DISCONNECT_NOT_YET:
+        case DISCONNECT_NORMAL:
+            /* we don't collect these */
+            break;
+        case DISCONNECT_CLIENT_EOF:
+            ldbstats->counts.n_sessions_abandoned++;
+            break;
+        case DISCONNECT_FATAL:
+            ldbstats->counts.n_sessions_fatal++;
+            break;
+        case DISCONNECT_KILLED:
+            ldbstats->counts.n_sessions_killed++;
+            break;
     }
 }
 
-
 /* ----------
- * PgstatCollectorMain() -
+ * get_local_dbstat_entry() -
  *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
-            break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                case PGSTAT_MTYPE_CONNECTION:
-                    pgstat_recv_connstat(&msg.msg_conn, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
-            break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
-}
-
-/*
- * Subroutine to clear stats in a database entry
- *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-    dbentry->n_sessions = 0;
-    dbentry->total_session_time = 0;
-    dbentry->total_active_time = 0;
-    dbentry->total_idle_in_xact_time = 0;
-    dbentry->n_sessions_abandoned = 0;
-    dbentry->n_sessions_fatal = 0;
-    dbentry->n_sessions_killed = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5228,7 +5646,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5238,200 +5656,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t                    len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *)dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5465,113 +5755,63 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5580,681 +5820,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey        key;
+        PgStat_StatEntryHeader *p;
+        size_t                len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                              false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",    statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag !=  'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6271,941 +5980,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx < nReplSlotStats);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        if (idx < nReplSlotStats - 1)
-            memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-                   sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_connstat() -
- *
- *  Process connection information.
- * ----------
- */
-static void
-pgstat_recv_connstat(PgStat_MsgConn *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_sessions += msg->m_count;
-    dbentry->total_session_time += msg->m_session_time;
-    dbentry->total_active_time += msg->m_active_time;
-    dbentry->total_idle_in_xact_time += msg->m_idle_in_xact_time;
-    switch (msg->m_disconnect)
-    {
-        case DISCONNECT_NOT_YET:
-        case DISCONNECT_NORMAL:
-            /* we don't collect these */
-            break;
-        case DISCONNECT_CLIENT_EOF:
-            dbentry->n_sessions_abandoned++;
-            break;
-        case DISCONNECT_FATAL:
-            dbentry->n_sessions_fatal++;
-            break;
-        case DISCONNECT_KILLED:
-            dbentry->n_sessions_killed++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7252,60 +6045,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7350,7 +6089,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7361,7 +6100,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7371,41 +6110,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index af91c313e2..9b9c9b1c11 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,7 +251,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -512,7 +511,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1327,12 +1325,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1782,11 +1774,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2699,8 +2686,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3029,8 +3014,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3097,13 +3080,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3176,22 +3152,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3653,22 +3613,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, SIGQUIT);
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3895,8 +3839,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3920,8 +3862,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3931,8 +3872,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4133,8 +4073,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5067,18 +5005,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkarch") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgArchiverMain(argc, argv); /* does not return */
-    }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5191,12 +5117,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6113,7 +6033,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6169,8 +6088,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6403,7 +6320,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635550..d21801cf90 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e00c7ffc01..608818beea 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -692,14 +692,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..517354bed2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2059,7 +2059,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2169,7 +2169,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2359,7 +2359,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2367,7 +2367,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..78cfe91eab 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -149,6 +149,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db7e59f8b7..e6e4c0fb04 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -177,7 +177,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 774292fd94..91bf8d5b5d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..75d1695576 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     }
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8dab9fd578..7a6f7a9e34 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3274,6 +3274,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3845,6 +3851,7 @@ PostgresMain(int argc, char *argv[],
     volatile bool send_ready_for_query = true;
     bool        idle_in_transaction_timeout_enabled = false;
     bool        idle_session_timeout_enabled = false;
+    bool        idle_stats_update_timeout_enabled = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4241,11 +4248,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          *
          * Also, if an idle timeout is enabled, start the timer for that.
          */
@@ -4279,6 +4287,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4291,8 +4301,14 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                /* Start the idle-stats-update timer */
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    idle_stats_update_timeout_enabled = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -4326,9 +4342,9 @@ PostgresMain(int argc, char *argv[],
         firstchar = ReadCommand(&input_message);
 
         /*
-         * (4) turn off the idle-in-transaction and idle-session timeouts, if
-         * active.  We do this before step (5) so that any last-moment timeout
-         * is certain to be detected in step (5).
+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so
+         * that any last-moment timeout is certain to be detected in step (5).
          *
          * At most one of these timeouts will be active, so there's no need to
          * worry about combining the timeout.c calls into one.
@@ -4343,6 +4359,11 @@ PostgresMain(int argc, char *argv[],
             disable_timeout(IDLE_SESSION_TIMEOUT, false);
             idle_session_timeout_enabled = false;
         }
+        if (idle_stats_update_timeout_enabled)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            idle_stats_update_timeout_enabled = false;
+        }
 
         /*
          * (5) disable async signal conditions again.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 62bff52638..6eaafcc86a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1269,7 +1266,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1285,7 +1282,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1301,7 +1298,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1317,7 +1314,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1333,7 +1330,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1349,7 +1346,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1365,7 +1362,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1381,7 +1378,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1397,7 +1394,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1430,7 +1427,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1446,7 +1443,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1461,7 +1458,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1476,7 +1473,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1491,7 +1488,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1506,7 +1503,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1521,7 +1518,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1536,11 +1533,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1555,7 +1552,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1573,7 +1570,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1610,7 +1607,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1626,7 +1623,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1640,7 +1637,7 @@ pg_stat_get_db_session_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_session_time) / 1000.0;
+        result = ((double) dbentry->counts.total_session_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1654,7 +1651,7 @@ pg_stat_get_db_active_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_active_time) / 1000.0;
+        result = ((double) dbentry->counts.total_active_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1668,7 +1665,7 @@ pg_stat_get_db_idle_in_transaction_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_idle_in_xact_time) / 1000.0;
+        result = ((double) dbentry->counts.total_idle_in_xact_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1681,7 +1678,7 @@ pg_stat_get_db_sessions(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions);
+        result = (int64) (dbentry->counts.n_sessions);
 
     PG_RETURN_INT64(result);
 }
@@ -1694,7 +1691,7 @@ pg_stat_get_db_sessions_abandoned(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_abandoned);
+        result = (int64) (dbentry->counts.n_sessions_abandoned);
 
     PG_RETURN_INT64(result);
 }
@@ -1707,7 +1704,7 @@ pg_stat_get_db_sessions_fatal(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_fatal);
+        result = (int64) (dbentry->counts.n_sessions_fatal);
 
     PG_RETURN_INT64(result);
 }
@@ -1720,7 +1717,7 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_killed);
+        result = (int64) (dbentry->counts.n_sessions_killed);
 
     PG_RETURN_INT64(result);
 }
@@ -1728,69 +1725,71 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1804,7 +1803,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1829,11 +1828,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2114,7 +2113,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2202,7 +2201,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2272,7 +2271,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2305,7 +2304,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 7ef510cd01..0762c2034c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2366,6 +2367,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index ea28769d6a..997afcab6d 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -34,6 +34,7 @@ volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t IdleSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 0f67b99cc5..2567668b6c 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -269,9 +269,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index e5965bc517..d4c17fd7ab 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void IdleSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -621,6 +622,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1243,6 +1246,14 @@ IdleSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..85299e2138 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1457,7 +1457,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1468,7 +1468,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1477,7 +1477,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -4356,7 +4356,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4692,7 +4692,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4f5b6bdb12 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -580,7 +580,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..c61b828bf3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::More tests => 108;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index adb9f819bb..12708b9470 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -84,6 +84,8 @@ extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -322,7 +324,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..222104b88e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,8 +30,8 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR        "pg_stat_tmp"
@@ -51,39 +54,6 @@ typedef enum SessionEndType
     DISCONNECT_KILLED
 } SessionEndType;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-    PGSTAT_MTYPE_CONNECTION,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -94,9 +64,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -170,10 +139,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -197,353 +169,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -559,7 +235,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -575,117 +250,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-/* ----------
- * PgStat_MsgConn            Sent by the backend to update connection statistics.
- * ----------
- */
-typedef struct PgStat_MsgConn
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Counter m_count;
-    PgStat_Counter m_session_time;
-    PgStat_Counter m_active_time;
-    PgStat_Counter m_idle_in_xact_time;
-    SessionEndType m_disconnect;
-} PgStat_MsgConn;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-    PgStat_MsgConn msg_conn;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -694,13 +260,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BCA0
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -710,7 +272,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -720,7 +281,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
     PgStat_Counter n_sessions;
@@ -730,26 +290,84 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_sessions_abandoned;
     PgStat_Counter n_sessions_fatal;
     PgStat_Counter n_sessions_killed;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -769,99 +387,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -870,7 +424,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -919,7 +473,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1172,7 +726,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1369,18 +923,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1400,33 +962,27 @@ extern SessionEndType pgStatSessionEndCause;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1466,6 +1022,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1588,9 +1145,10 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_send_wal(void);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
+extern void pgstat_report_wal(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1599,15 +1157,20 @@ extern void pgstat_send_wal(void);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1618,5 +1181,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..9ed6b54428 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -218,6 +218,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index b9b5c1adda..add9c53ee3 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index ecb2a366a5..f090f7372a 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     IDLE_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
-- 
2.27.0

From 9a8a963e0e9c6fcc50a8dcf9a74300e761310929 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v47 5/7] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  34 ++++----
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 5 files changed, 90 insertions(+), 99 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 43d7a1ad90..179607fd5a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9245,9 +9245,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..b0c25c9c5c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7327,11 +7327,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7347,14 +7347,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7385,9 +7384,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
@@ -8485,7 +8484,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       <listitem>
        <para>
         Specifies the fraction of the total number of heap tuples counted in
-        the previous statistics collection that can be inserted without
+        the previously collected statistics that can be inserted without
         incurring an index scan at the <command>VACUUM</command> cleanup stage.
         This setting currently applies to B-tree indexes only.
        </para>
@@ -8497,9 +8496,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         the index contains deleted pages that can be recycled during cleanup.
         Index statistics are considered to be stale if the number of newly
         inserted tuples exceeds the <varname>vacuum_cleanup_index_scale_factor</varname>
-        fraction of the total number of heap tuples detected by the previous
-        statistics collection. The total number of heap tuples is stored in
-        the index meta-page. Note that the meta-page does not include this data
+
+        fraction of the total number of heap tuples in the previously
+        collected statistics. The total number of heap tuples is stored in the
+        index meta-page. Note that the meta-page does not include this data
         until <command>VACUUM</command> finds no dead tuples, so B-tree index
         scan at the cleanup stage can only be skipped if the second and
         subsequent <command>VACUUM</command> cycles detect no dead tuples.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index dc263e4106..f3ff7b6b6a 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2365,12 +2365,13 @@ HINT:  You can then restart the server after making the necessary configuration
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index f05140dd42..9c9dedb6b1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -196,18 +195,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -220,48 +212,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -643,7 +633,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1080,10 +1070,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1838,6 +1824,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -6073,9 +6063,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bcbb7a25fb..1fa59a2fdf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
-- 
2.27.0

From 5f43e352e6133a84ef63eacd9401b0023bbe7164 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v47 6/7] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 72a117fc19..0a98b2f2c0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -89,14 +89,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201218;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3c8aaed0b6..7557a375f0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1146,8 +1146,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b0c25c9c5c..084bc57779 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7440,25 +7440,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8f431759c6..2bbb0ef437 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char       *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_send_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d21801cf90..d2c3064678 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 85299e2138..16e430fb28 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -203,7 +203,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -560,8 +559,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4355,17 +4352,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11773,35 +11759,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4f5b6bdb12..20c24a9d78 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -587,7 +587,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e242a4a5b5..6d59562eac 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -218,7 +218,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2618b4c957..ab5cb51de7 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 222104b88e..ce4feaea3b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..dd41a43b4e 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From adfdf850065b8ad1ecc0b31f7ca4123c30ac34e8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v47 7/7] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index d2c3064678..25677c5c6e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Thanks for committing this!  I'm very happy to see this reduces the
> > size of this patchset.
> 
> Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
> with old 0004, I'd like to post a rebased version for future work.
> 
> The commit 33394ee6f2 adds on-exit forced write of WAL stats on
> walwriter and in this patch that part would appear to have been
> removed.  However, this patchset already does that by calling to
> pgstat_report_stat from pgstat_beshutdown_hook.

Rebased and fixed two bugs.  Not addressed received comments in this
version.

5f79580ad6 confilicts with this. However, the modified code is removed
by the patch. (And 081876d75e made a small conflict.)

I fixed two silly bugs along with the rebasing.

1) An assertion failure happens while accessing pg_stat_database,
   since pgstat_fetch_stat_dbentry() rejected 0 as database oid.

2) pgstat_report_stat() failed to flush out global database stats (oid
  = 0). I stopped to collecgt database entries in the existing loop on
  pgStatLocalhash, then modified the existing second loop to scan
  pgStatLocalhash directly for database stats entries. The second loop
  might get slow when the first loop left many table/function local
  stats entreis due to heavy load.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 571027eda41ece044c9e64264d05205437c1ea55 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v57 1/6] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This commit
adds the functionality.  The interface is similar but a bit different
both from that of dynahash and simple dshash search functions.  One of
the most significant differences is the sequential scan interface of
dshash always needs a call to dshash_seq_term when scan ends. Another
is locking.  Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a scan.
The seqscan interface allows entry deletion while a scan is in
progress. The in-scan deletion should be performed by
dshash_delete_current().
---
 src/backend/lib/dshash.c         | 161 ++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h         |  22 +++++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 183 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index e0c763be32..29ad767618 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -127,6 +127,10 @@ struct dshash_table
 #define NUM_SPLITS(size_log2)                    \
     (size_log2 - DSHASH_NUM_PARTITIONS_LOG2)
 
+/* How many buckets are there in a given size? */
+#define NUM_BUCKETS(size_log2)        \
+    (((size_t) 1) << (size_log2))
+
 /* How many buckets are there in each partition at a given size? */
 #define BUCKETS_PER_PARTITION(size_log2)        \
     (((size_t) 1) << NUM_SPLITS(size_log2))
@@ -153,6 +157,10 @@ struct dshash_table
 #define BUCKET_INDEX_FOR_PARTITION(partition, size_log2)    \
     ((partition) << NUM_SPLITS(size_log2))
 
+/* Choose partition based on bucket index. */
+#define PARTITION_FOR_BUCKET_INDEX(bucket_idx, size_log2)                \
+    ((bucket_idx) >> NUM_SPLITS(size_log2))
+
 /* The head of the active bucket for a given hash value (lvalue). */
 #define BUCKET_FOR_HASH(hash_table, hash)                                \
     (hash_table->buckets[                                                \
@@ -324,7 +332,7 @@ dshash_destroy(dshash_table *hash_table)
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
-    size = ((size_t) 1) << hash_table->size_log2;
+    size = NUM_BUCKETS(hash_table->size_log2);
     for (i = 0; i < size; ++i)
     {
         dsa_pointer item_pointer = hash_table->buckets[i];
@@ -592,6 +600,157 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan through dshash table and return all the
+ *           elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should always be called when a scan finished.
+ * The caller may delete returned elements midst of a scan by using
+ * dshash_delete_current(). exclusive must be true to delete elements.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                bool exclusive)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = 0;
+    status->curitem = NULL;
+    status->pnextitem = InvalidDsaPointer;
+    status->curpartition = -1;
+    status->exclusive = exclusive;
+}
+
+/*
+ * Returns the next element.
+ *
+ * Returned elements are locked and the caller must not explicitly release
+ * it. It is released at the next call to dshash_next().
+ */
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        int partition;
+
+        Assert(status->curbucket == 0);
+        Assert(!status->hash_table->find_locked);
+
+        /* first shot. grab the first item. */
+        partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+        LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+                      status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+        status->curpartition = partition;
+
+        /* resize doesn't happen from now until seq scan ends */
+        status->nbuckets =
+            NUM_BUCKETS(status->hash_table->control->size_log2);
+        ensure_valid_bucket_pointers(status->hash_table);
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->pnextitem;
+
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(status->hash_table,
+                                               status->curpartition),
+                                status->exclusive ? LW_EXCLUSIVE : LW_SHARED));
+
+    /* Move to the next bucket if we finished the current bucket */
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        int next_partition;
+
+        if (++status->curbucket >= status->nbuckets)
+        {
+            /* all buckets have been scanned. finish. */
+            return NULL;
+        }
+
+        /* Check if move to the next partition */
+        next_partition =
+            PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+                                       status->hash_table->size_log2);
+
+        if (status->curpartition != next_partition)
+        {
+            /*
+             * Move to the next partition. Lock the next partition then release
+             * the current, not in the reverse order to avoid concurrent
+             * resizing.  Avoid dead lock by taking lock in the same order
+             * with resize().
+             */
+            LWLockAcquire(PARTITION_LOCK(status->hash_table,
+                                         next_partition),
+                          status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+            LWLockRelease(PARTITION_LOCK(status->hash_table,
+                                         status->curpartition));
+            status->curpartition = next_partition;
+        }
+
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    status->hash_table->find_locked = true;
+    status->hash_table->find_exclusively_locked = status->exclusive;
+
+    /*
+     * The caller may delete the item. Store the next item in case of deletion.
+     */
+    status->pnextitem = status->curitem->next;
+
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+    status->hash_table->find_locked = false;
+    status->hash_table->find_exclusively_locked = false;
+
+    if (status->curpartition >= 0)
+        LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}
+
+/* Remove the current entry while a seq scan. */
+void
+dshash_delete_current(dshash_seq_status *status)
+{
+    dshash_table       *hash_table    = status->hash_table;
+    dshash_table_item  *item        = status->curitem;
+    size_t                partition PG_USED_FOR_ASSERTS_ONLY
+        = PARTITION_FOR_HASH(item->hash);
+
+    Assert(status->exclusive);
+    Assert(hash_table->control->magic == DSHASH_MAGIC);
+    Assert(hash_table->find_locked);
+    Assert(hash_table->find_exclusively_locked);
+    Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
+                                LW_EXCLUSIVE));
+
+    delete_item(hash_table, item);
+}
+
+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index c069ec9de7..a6ea377173 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,21 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+/*
+ * Sequential scan state. The detail is exposed to let users know the storage
+ * size but it should be considered as an opaque type by callers.
+ */
+typedef struct dshash_seq_status
+{
+    dshash_table       *hash_table;    /* dshash table working on */
+    int                    curbucket;    /* bucket number we are at */
+    int                    nbuckets;    /* total number of buckets in the dshash */
+    dshash_table_item  *curitem;    /* item we are currently at */
+    dsa_pointer            pnextitem;    /* dsa-pointer to the next item */
+    int                    curpartition;    /* partition number we are at */
+    bool                exclusive;    /* locking mode */
+} dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
                                    const dshash_parameters *params,
@@ -80,6 +95,13 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                            bool exclusive);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_term(dshash_seq_status *status);
+extern void dshash_delete_current(dshash_seq_status *status);
+extern void *dshash_get_current(dshash_seq_status *status);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1d1d5d2f0e..58ad2991a7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2961,6 +2961,7 @@ dshash_hash
 dshash_hash_function
 dshash_parameters
 dshash_partition
+dshash_seq_status
 dshash_table
 dshash_table_control
 dshash_table_handle
-- 
2.27.0

From 389261c931054e5a2a0044a414bb3a1e96885ddd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:57 +0900
Subject: [PATCH v57 2/6] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. It is inconvenient
when we want to avoid being blocked by other processes. This commit
adds alternative functions of dshash_find and dshash_find_or_insert
that allows immediate return on lock failure.
---
 src/backend/lib/dshash.c | 98 +++++++++++++++++++++-------------------
 src/include/lib/dshash.h |  3 ++
 2 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 29ad767618..f79b6de245 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -383,6 +383,10 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
  * the caller must take care to ensure that the entry is not left corrupted.
  * The lock mode is either shared or exclusive depending on 'exclusive'.
  *
+ * If found is not NULL, *found is set to true if the key is found in the hash
+ * table. If the key is not found, *found is set to false and a pointer to a
+ * newly created entry is returned.
+ *
  * The caller must not lock a lock already.
  *
  * Note that the lock held is in fact an LWLock, so interrupts will be held on
@@ -392,36 +396,7 @@ dshash_get_hash_table_handle(dshash_table *hash_table)
 void *
 dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
 {
-    dshash_hash hash;
-    size_t        partition;
-    dshash_table_item *item;
-
-    hash = hash_key(hash_table, key);
-    partition = PARTITION_FOR_HASH(hash);
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
-
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
-    ensure_valid_bucket_pointers(hash_table);
-
-    /* Search the active bucket. */
-    item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
-
-    if (!item)
-    {
-        /* Not found. */
-        LWLockRelease(PARTITION_LOCK(hash_table, partition));
-        return NULL;
-    }
-    else
-    {
-        /* The caller will free the lock by calling dshash_release_lock. */
-        hash_table->find_locked = true;
-        hash_table->find_exclusively_locked = exclusive;
-        return ENTRY_FROM_ITEM(item);
-    }
+    return dshash_find_extended(hash_table, key, exclusive, false, false, NULL);
 }
 
 /*
@@ -439,31 +414,60 @@ dshash_find_or_insert(dshash_table *hash_table,
                       const void *key,
                       bool *found)
 {
-    dshash_hash hash;
-    size_t        partition_index;
-    dshash_partition *partition;
+    return dshash_find_extended(hash_table, key, true, false, true, found);
+}
+
+
+/*
+ * Find the key in the hash table.
+ *
+ * "exclusive" is the lock mode in which the partition for the returned item
+ * is locked.  If "nowait" is true, the function immediately returns if
+ * required lock was not acquired.  "insert" indicates insert mode. In this
+ * mode new entry is inserted and set *found to false. *found is set to true if
+ * found. "found" must be non-null in this mode.
+ */
+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+                     bool exclusive, bool nowait, bool insert, bool *found)
+{
+    dshash_hash hash = hash_key(hash_table, key);
+    size_t        partidx = PARTITION_FOR_HASH(hash);
+    dshash_partition *partition = &hash_table->control->partitions[partidx];
+    LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
     dshash_table_item *item;
 
-    hash = hash_key(hash_table, key);
-    partition_index = PARTITION_FOR_HASH(hash);
-    partition = &hash_table->control->partitions[partition_index];
-
-    Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
+    /* must be exclusive when insert allowed */
+    Assert(!insert || (exclusive && found != NULL));
 
 restart:
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-                  LW_EXCLUSIVE);
+    if (!nowait)
+        LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+    else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+                                       lockmode))
+        return NULL;
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
     if (item)
-        *found = true;
+    {
+        if (found)
+            *found = true;
+    }
     else
     {
-        *found = false;
+        if (found)
+            *found = false;
+
+        if (!insert)
+        {
+            /* The caller didn't told to add a new entry. */
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+            return NULL;
+        }
 
         /* Check if we are getting too full. */
         if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))
@@ -479,7 +483,8 @@ restart:
              * Give up our existing lock first, because resizing needs to
              * reacquire all the locks in the right order to avoid deadlocks.
              */
-            LWLockRelease(PARTITION_LOCK(hash_table, partition_index));
+            LWLockRelease(PARTITION_LOCK(hash_table, partidx));
+
             resize(hash_table, hash_table->size_log2 + 1);
 
             goto restart;
@@ -493,12 +498,13 @@ restart:
         ++partition->count;
     }
 
-    /* The caller must release the lock with dshash_release_lock. */
+    /* The caller will free the lock by calling dshash_release_lock. */
     hash_table->find_locked = true;
-    hash_table->find_exclusively_locked = true;
+    hash_table->find_exclusively_locked = exclusive;
     return ENTRY_FROM_ITEM(item);
 }
 
+
 /*
  * Remove an entry by key.  Returns true if the key was found and the
  * corresponding entry was removed.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index a6ea377173..5b8114d041 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -91,6 +91,9 @@ extern void *dshash_find(dshash_table *hash_table,
                          const void *key, bool exclusive);
 extern void *dshash_find_or_insert(dshash_table *hash_table,
                                    const void *key, bool *found);
+extern void *dshash_find_extended(dshash_table *hash_table, const void *key,
+                                  bool exclusive, bool nowait, bool insert,
+                                  bool *found);
 extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
-- 
2.27.0

From 6c90d462ee32d83bc2523fe234277167f5579fdc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 9 Mar 2021 15:07:02 +0900
Subject: [PATCH v57 3/6] Shared-memory based stats collector

Previously activity statistics is collected via sockets and shared
among backends through files periodically. Such files reaches tens of
megabytes and are created at most every 1 second and such large data
is serialized by stats collector then de-serialized on every backend
periodically. To evade that large cost, this patch places activity
statistics data on shared memory. Each backend accumulates statistics
numbers locally then tries to move them onto the shared statistics at
every transaction end but with intervals not shorter than 10 seconds.
Until 60 second has elapsed since the last flushing to shared stats,
lock failure postpones stats flushing to try to alleviate lock
contention that slows down transactions.  Finally stats flush waits
for locks so that shared statistics doesn't get stale.
---
 src/backend/access/heap/heapam_handler.c      |    4 +-
 src/backend/access/heap/vacuumlazy.c          |    2 +-
 src/backend/access/transam/xlog.c             |   14 +-
 src/backend/catalog/index.c                   |   24 +-
 src/backend/commands/dbcommands.c             |    2 +-
 src/backend/commands/matview.c                |    8 +-
 src/backend/commands/vacuum.c                 |    4 +-
 src/backend/postmaster/autovacuum.c           |   76 +-
 src/backend/postmaster/bgwriter.c             |    4 +-
 src/backend/postmaster/checkpointer.c         |   30 +-
 src/backend/postmaster/pgarch.c               |   12 +-
 src/backend/postmaster/pgstat.c               | 6378 +++++++----------
 src/backend/postmaster/postmaster.c           |   82 +-
 src/backend/postmaster/walwriter.c            |   14 +-
 src/backend/replication/basebackup.c          |    4 +-
 src/backend/replication/slot.c                |   12 +-
 src/backend/storage/buffer/bufmgr.c           |    8 +-
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    4 +-
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/storage/smgr/smgr.c               |    4 +-
 src/backend/tcop/postgres.c                   |   41 +-
 src/backend/utils/adt/pgstatfuncs.c           |  109 +-
 src/backend/utils/cache/relcache.c            |    5 +
 src/backend/utils/init/globals.c              |    1 +
 src/backend/utils/init/miscinit.c             |    3 -
 src/backend/utils/init/postinit.c             |   11 +
 src/backend/utils/misc/guc.c                  |   16 +-
 src/backend/utils/misc/postgresql.conf.sample |    2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    4 +-
 src/include/miscadmin.h                       |    3 +-
 src/include/pgstat.h                          |  760 +-
 src/include/storage/lwlock.h                  |    1 +
 src/include/utils/guc_tables.h                |    2 +-
 src/include/utils/timeout.h                   |    1 +
 src/tools/pgindent/typedefs.list              |   52 +-
 36 files changed, 2913 insertions(+), 4787 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bd5faf0c1f..c88339282f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1086,8 +1086,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
                  * our own.  In this case we should count and sample the row,
                  * to accommodate users who load a table and analyze it in one
                  * transaction.  (pgstat_report_analyze has to adjust the
-                 * numbers we send to the stats collector to make this come
-                 * out right.)
+                 * numbers we report to the activity stats facility to make
+                 * this come out right.)
                  */
                 if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
                 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8341879d89..3eae964841 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -606,7 +606,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
                         new_min_multi,
                         false);
 
-    /* report results to the stats collector, too */
+    /* report results to the activity stats facility, too */
     pgstat_report_vacuum(RelationGetRelid(onerel),
                          onerel->rd_rel->relisshared,
                          Max(new_live_tuples, 0),
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..479c9ee7dc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2206,7 +2206,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
                     WriteRqst.Flush = 0;
                     XLogWrite(WriteRqst, false);
                     LWLockRelease(WALWriteLock);
-                    WalStats.m_wal_buffers_full++;
+                    WalStats.wal_buffers_full++;
                     TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
                 }
                 /* Re-acquire WALBufMappingLock and retry */
@@ -2564,10 +2564,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
                     INSTR_TIME_SET_CURRENT(duration);
                     INSTR_TIME_SUBTRACT(duration, start);
-                    WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
+                    WalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
                 }
 
-                WalStats.m_wal_write++;
+                WalStats.wal_write++;
 
                 if (written <= 0)
                 {
@@ -8724,8 +8724,8 @@ LogCheckpointEnd(bool restartpoint)
                                                  CheckpointStats.ckpt_sync_end_t);
 
     /* Accumulate checkpoint timing summary data, in milliseconds. */
-    BgWriterStats.m_checkpoint_write_time += write_msecs;
-    BgWriterStats.m_checkpoint_sync_time += sync_msecs;
+    CheckPointerStats.checkpoint_write_time += write_msecs;
+    CheckPointerStats.checkpoint_sync_time += sync_msecs;
 
     /*
      * All of the published timing statistics are accounted for.  Only
@@ -10675,10 +10675,10 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 
         INSTR_TIME_SET_CURRENT(duration);
         INSTR_TIME_SUBTRACT(duration, start);
-        WalStats.m_wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
+        WalStats.wal_sync_time += INSTR_TIME_GET_MICROSEC(duration);
     }
 
-    WalStats.m_wal_sync++;
+    WalStats.wal_sync++;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4ef61b5efd..85f8d32944 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1859,28 +1859,10 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 
     /*
      * Copy over statistics from old to new index
+     * The data will be flushed by the next pgstat_report_stat()
+     * call.
      */
-    {
-        PgStat_StatTabEntry *tabentry;
-
-        tabentry = pgstat_fetch_stat_tabentry(oldIndexId);
-        if (tabentry)
-        {
-            if (newClassRel->pgstat_info)
-            {
-                newClassRel->pgstat_info->t_counts.t_numscans = tabentry->numscans;
-                newClassRel->pgstat_info->t_counts.t_tuples_returned = tabentry->tuples_returned;
-                newClassRel->pgstat_info->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
-                newClassRel->pgstat_info->t_counts.t_blocks_hit = tabentry->blocks_hit;
-
-                /*
-                 * The data will be sent by the next pgstat_report_stat()
-                 * call.
-                 */
-            }
-        }
-    }
+    pgstat_copy_index_counters(oldIndexId, newClassRel->pgstat_info);
 
     /* Copy data of pg_statistic from the old index to the new one */
     CopyStatistics(oldIndexId, newIndexId);
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2b159b60eb..acea4de382 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -971,7 +971,7 @@ dropdb(const char *dbname, bool missing_ok, bool force)
     DropDatabaseBuffers(db_id);
 
     /*
-     * Tell the stats collector to forget it immediately, too.
+     * Tell the active stats facility to forget it immediately, too.
      */
     pgstat_drop_database(db_id);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 172ec6e982..8c9ae24b78 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -336,10 +336,10 @@ ExecRefreshMatView(RefreshMatViewStmt *stmt, const char *queryString,
         refresh_by_heap_swap(matviewOid, OIDNewHeap, relpersistence);
 
         /*
-         * Inform stats collector about our activity: basically, we truncated
-         * the matview and inserted some new data.  (The concurrent code path
-         * above doesn't need to worry about this because the inserts and
-         * deletes it issues get counted by lower-level code.)
+         * Inform activity stats facility about our activity: basically, we
+         * truncated the matview and inserted some new data.  (The concurrent
+         * code path above doesn't need to worry about this because the inserts
+         * and deletes it issues get counted by lower-level code.)
          */
         pgstat_count_truncate(matviewRel);
         if (!stmt->skipData)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..7397c3704d 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -329,8 +329,8 @@ vacuum(List *relations, VacuumParams *params,
                  errmsg("PROCESS_TOAST required with VACUUM FULL")));
 
     /*
-     * Send info about dead objects to the statistics collector, unless we are
-     * in autovacuum --- autovacuum.c does this for itself.
+     * Send info about dead objects to the activity statistics facility, unless
+     * we are in autovacuum --- autovacuum.c does this for itself.
      */
     if ((params->options & VACOPT_VACUUM) && !IsAutoVacuumWorkerProcess())
         pgstat_vacuum_stat();
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 23ef23c13e..46f969bef3 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -342,9 +342,6 @@ static void autovacuum_do_vac_analyze(autovac_table *tab,
                                       BufferAccessStrategy bstrategy);
 static AutoVacOpts *extract_autovac_opts(HeapTuple tup,
                                          TupleDesc pg_class_desc);
-static PgStat_StatTabEntry *get_pgstat_tabentry_relid(Oid relid, bool isshared,
-                                                      PgStat_StatDBEntry *shared,
-                                                      PgStat_StatDBEntry *dbentry);
 static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
@@ -1684,12 +1681,12 @@ AutoVacWorkerMain(int argc, char *argv[])
         char        dbname[NAMEDATALEN];
 
         /*
-         * Report autovac startup to the stats collector.  We deliberately do
-         * this before InitPostgres, so that the last_autovac_time will get
-         * updated even if the connection attempt fails.  This is to prevent
-         * autovac from getting "stuck" repeatedly selecting an unopenable
-         * database, rather than making any progress on stuff it can connect
-         * to.
+         * Report autovac startup to the activity stats facility.  We
+         * deliberately do this before InitPostgres, so that the
+         * last_autovac_time will get updated even if the connection attempt
+         * fails.  This is to prevent autovac from getting "stuck" repeatedly
+         * selecting an unopenable database, rather than making any progress on
+         * stuff it can connect to.
          */
         pgstat_report_autovac(dbid);
 
@@ -1961,8 +1958,6 @@ do_autovacuum(void)
     HASHCTL        ctl;
     HTAB       *table_toast_map;
     ListCell   *volatile cell;
-    PgStat_StatDBEntry *shared;
-    PgStat_StatDBEntry *dbentry;
     BufferAccessStrategy bstrategy;
     ScanKeyData key;
     TupleDesc    pg_class_desc;
@@ -1981,17 +1976,11 @@ do_autovacuum(void)
                                           ALLOCSET_DEFAULT_SIZES);
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /*
-     * may be NULL if we couldn't find an entry (only happens if we are
-     * forcing a vacuum for anti-wrap purposes).
-     */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-
     /* Start a transaction so our commands have one to play into. */
     StartTransactionCommand();
 
     /*
-     * Clean up any dead statistics collector entries for this DB. We always
+     * Clean up any dead activity statistics entries for this DB. We always
      * want to do this exactly once per DB-processing cycle, even if we find
      * nothing worth vacuuming in the database.
      */
@@ -2034,9 +2023,6 @@ do_autovacuum(void)
     /* StartTransactionCommand changed elsewhere */
     MemoryContextSwitchTo(AutovacMemCxt);
 
-    /* The database hash where pgstat keeps shared relations */
-    shared = pgstat_fetch_stat_dbentry(InvalidOid);
-
     classRel = table_open(RelationRelationId, AccessShareLock);
 
     /* create a copy so we can use it after closing pg_class */
@@ -2114,8 +2100,8 @@ do_autovacuum(void)
 
         /* Fetch reloptions and the pgstat entry for this table */
         relopts = extract_autovac_opts(tuple, pg_class_desc);
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         /* Check if it needs vacuum or analyze */
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
@@ -2198,8 +2184,8 @@ do_autovacuum(void)
         }
 
         /* Fetch the pgstat entry for this table */
-        tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                             shared, dbentry);
+        tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                       relid);
 
         relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
                                   effective_multixact_freeze_max_age,
@@ -2758,29 +2744,6 @@ extract_autovac_opts(HeapTuple tup, TupleDesc pg_class_desc)
     return av;
 }
 
-/*
- * get_pgstat_tabentry_relid
- *
- * Fetch the pgstat entry of a table, either local to a database or shared.
- */
-static PgStat_StatTabEntry *
-get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
-                          PgStat_StatDBEntry *dbentry)
-{
-    PgStat_StatTabEntry *tabentry = NULL;
-
-    if (isshared)
-    {
-        if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
-    }
-    else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
-
-    return tabentry;
-}
 
 /*
  * table_recheck_autovac
@@ -2985,17 +2948,10 @@ recheck_relation_needs_vacanalyze(Oid relid,
                                   bool *wraparound)
 {
     PgStat_StatTabEntry *tabentry;
-    PgStat_StatDBEntry *shared = NULL;
-    PgStat_StatDBEntry *dbentry = NULL;
-
-    if (classForm->relisshared)
-        shared = pgstat_fetch_stat_dbentry(InvalidOid);
-    else
-        dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
 
     /* fetch the pgstat table entry */
-    tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-                                         shared, dbentry);
+    tabentry = pgstat_fetch_stat_tabentry_extended(classForm->relisshared,
+                                                   relid);
 
     relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
                               effective_multixact_freeze_max_age,
@@ -3025,7 +2981,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * For analyze, the analysis done is that the number of tuples inserted,
  * deleted and updated since the last analyze exceeds a threshold calculated
- * in the same fashion as above.  Note that the collector actually stores
+ * in the same fashion as above.  Note that the activity statistics stores
  * the number of tuples (both live and dead) that there were as of the last
  * analyze.  This is asymmetric to the VACUUM case.
  *
@@ -3035,8 +2991,8 @@ recheck_relation_needs_vacanalyze(Oid relid,
  *
  * A table whose autovacuum_enabled option is false is
  * automatically skipped (unless we have to vacuum it due to freeze_max_age).
- * Thus autovacuum can be disabled for specific tables. Also, when the stats
- * collector does not have data about a table, it will be skipped.
+ * Thus autovacuum can be disabled for specific tables. Also, when the activity
+ * statistics does not have data about a table, it will be skipped.
  *
  * A table whose vac_base_thresh value is < 0 takes the base value from the
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 715d5195bb..679992dc89 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -244,9 +244,9 @@ BackgroundWriterMain(void)
         can_hibernate = BgBufferSync(&wb_context);
 
         /*
-         * Send off activity statistics to the stats collector
+         * Send off activity statistics to the activity stats facility
          */
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
 
         if (FirstCallSinceLastCheckpoint())
         {
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5907a7befc..5973c1deeb 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -360,7 +360,7 @@ CheckpointerMain(void)
         if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
         {
             do_checkpoint = true;
-            BgWriterStats.m_requested_checkpoints++;
+            CheckPointerStats.requested_checkpoints++;
         }
 
         /*
@@ -374,7 +374,7 @@ CheckpointerMain(void)
         if (elapsed_secs >= CheckPointTimeout)
         {
             if (!do_checkpoint)
-                BgWriterStats.m_timed_checkpoints++;
+                CheckPointerStats.timed_checkpoints++;
             do_checkpoint = true;
             flags |= CHECKPOINT_CAUSE_TIME;
         }
@@ -495,16 +495,10 @@ CheckpointerMain(void)
         /* Check for archive_timeout and switch xlog files if necessary. */
         CheckArchiveTimeout();
 
-        /*
-         * Send off activity statistics to the stats collector.  (The reason
-         * why we re-use bgwriter-related code for this is that the bgwriter
-         * and checkpointer used to be just one process.  It's probably not
-         * worth the trouble to split the stats support into two independent
-         * stats message types.)
-         */
-        pgstat_send_bgwriter();
+        /* Send off activity statistics to the activity stats facility. */
+        pgstat_report_checkpointer();
 
-        /* Send WAL statistics to the stats collector. */
+        /* Send off WAL statistics to the activity stats facility. */
         pgstat_report_wal();
 
         /*
@@ -580,9 +574,9 @@ HandleCheckpointerInterrupts(void)
          * updates the statistics, increment the checkpoint request and send
          * the statistics to the stats collector.
          */
-        BgWriterStats.m_requested_checkpoints++;
+        CheckPointerStats.requested_checkpoints++;
         ShutdownXLOG(0, 0);
-        pgstat_send_bgwriter();
+        pgstat_report_bgwriter();
         pgstat_report_wal();
 
         /* Normal exit from the checkpointer is here */
@@ -722,9 +716,9 @@ CheckpointWriteDelay(int flags, double progress)
         CheckArchiveTimeout();
 
         /*
-         * Report interim activity statistics to the stats collector.
+         * Report interim activity statistics.
          */
-        pgstat_send_bgwriter();
+        pgstat_report_checkpointer();
 
         /*
          * This sleep used to be connected to bgwriter_delay, typically 200ms.
@@ -1268,8 +1262,10 @@ AbsorbSyncRequests(void)
     LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
     /* Transfer stats counts into pending pgstats message */
-    BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-    BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+    CheckPointerStats.buf_written_backend
+        += CheckpointerShmem->num_backend_writes;
+    CheckPointerStats.buf_fsync_backend
+        += CheckpointerShmem->num_backend_fsync;
 
     CheckpointerShmem->num_backend_writes = 0;
     CheckpointerShmem->num_backend_fsync = 0;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 746c836d42..c200717aa4 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -417,20 +417,20 @@ pgarch_ArchiverCopyLoop(void)
                 pgarch_archiveDone(xlog);
 
                 /*
-                 * Tell the collector about the WAL file that we successfully
-                 * archived
+                 * Tell the activity statistics facility about the WAL file
+                 * that we successfully archived
                  */
-                pgstat_send_archiver(xlog, false);
+                pgstat_report_archiver(xlog, false);
 
                 break;            /* out of inner retry loop */
             }
             else
             {
                 /*
-                 * Tell the collector about the WAL file that we failed to
-                 * archive
+                 * Tell the activity statistics facility about the WAL file
+                 * that we failed to archive
                  */
-                pgstat_send_archiver(xlog, true);
+                pgstat_report_archiver(xlog, true);
 
                 if (++failures >= NUM_ARCHIVE_RETRIES)
                 {
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 208a33692f..bd09bb6d3b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,22 @@
 /* ----------
  * pgstat.c
  *
- *    All the statistics collector stuff hacked up in one big, ugly file.
+ *    Activity Statistics facility.
  *
- *    TODO:    - Separate collector, postmaster and backend stuff
- *              into different files.
+ *  Collects activity statistics, e.g. per-table access statistics, of
+ *  all backends in shared memory. The activity numbers are first stored
+ *  locally in each process, then flushed to shared memory at commit
+ *  time or by idle-timeout.
  *
- *            - Add some automatic call for pgstat vacuuming.
+ * To avoid congestion on the shared memory, shared stats is updated no more
+ * often than once per PGSTAT_MIN_INTERVAL (10000ms). If some local numbers
+ * remain unflushed for lock failure, retry with intervals that is initially
+ * PGSTAT_RETRY_MIN_INTERVAL (1000ms) then doubled at every retry. Finally we
+ * force update after PGSTAT_MAX_INTERVAL (60000ms) since the first trial.
  *
- *            - Add a pgstat config column to pg_database, so this
- *              entire thing can be enabled/disabled on a per db basis.
+ *  The first process that uses activity statistics facility creates the area
+ *  then load the stored stats file if any, and the last process at shutdown
+ *  writes the shared stats to the file then destroy the area before exit.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -19,18 +26,6 @@
 #include "postgres.h"
 
 #include <unistd.h>
-#include <fcntl.h>
-#include <sys/param.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <netdb.h>
-#include <netinet/in.h>
-#include <arpa/inet.h>
-#include <signal.h>
-#include <time.h>
-#ifdef HAVE_SYS_SELECT_H
-#include <sys/select.h>
-#endif
 
 #include "access/heapam.h"
 #include "access/htup_details.h"
@@ -40,13 +35,9 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
-#include "common/ip.h"
-#include "executor/instrument.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
-#include "libpq/pqsignal.h"
-#include "mb/pg_wchar.h"
 #include "miscadmin.h"
-#include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/fork_process.h"
@@ -54,20 +45,16 @@
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
-#include "storage/backendid.h"
-#include "storage/dsm.h"
-#include "storage/fd.h"
+#include "storage/condition_variable.h"
 #include "storage/ipc.h"
-#include "storage/latch.h"
 #include "storage/lmgr.h"
-#include "storage/pg_shmem.h"
+#include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "utils/ascii.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
-#include "utils/ps_status.h"
-#include "utils/rel.h"
+#include "utils/probes.h"
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 
@@ -75,35 +62,20 @@
  * Timer definitions.
  * ----------
  */
-#define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
-                                     * updates; in milliseconds. */
+#define PGSTAT_MIN_INTERVAL            10000    /* Minimum interval of stats data
+                                             * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
-                                     * failed statistics collector; in
-                                     * seconds. */
-
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
-/* Minimum receive buffer size for the collector's socket. */
-#define PGSTAT_MIN_RCVBUF        (100 * 1024)
+#define PGSTAT_RETRY_MIN_INTERVAL    1000    /* Initial retry interval after
+                                             * PGSTAT_MIN_INTERVAL */
 
+#define PGSTAT_MAX_INTERVAL            60000    /* Longest interval of stats data
+                                             * updates */
 
 /* ----------
- * The initial size hints for the hash tables used in the collector.
+ * The initial size hints for the hash tables used in the activity statistics.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
-#define PGSTAT_TAB_HASH_SIZE    512
+#define PGSTAT_TABLE_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
 
@@ -118,7 +90,6 @@
  */
 #define NumBackendStatSlots (MaxBackends + NUM_AUXPROCTYPES)
 
-
 /* ----------
  * GUC parameters
  * ----------
@@ -132,17 +103,11 @@ int            pgstat_track_activity_query_size = 1024;
  * Built from GUC parameter
  * ----------
  */
-char       *pgstat_stat_directory = NULL;
-char       *pgstat_stat_filename = NULL;
-char       *pgstat_stat_tmpname = NULL;
+char      *pgstat_stat_directory = NULL;
 
-/*
- * BgWriter and WAL global statistics counters.
- * Stored directly in a stats message structure so they can be sent
- * without needing to copy things around.  We assume these init to zeroes.
- */
-PgStat_MsgBgWriter BgWriterStats;
-PgStat_MsgWal WalStats;
+/* No longer used, but will be removed with GUC */
+char      *pgstat_stat_filename = NULL;
+char      *pgstat_stat_tmpname = NULL;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
@@ -170,73 +135,247 @@ static const char *const slru_names[] = {
 
 #define SLRU_NUM_ELEMENTS    lengthof(slru_names)
 
+/* struct for shared SLRU stats */
+typedef struct PgStatSharedSLRUStats
+{
+    PgStat_SLRUStats entry[SLRU_NUM_ELEMENTS];
+    LWLock        lock;
+    pg_atomic_uint32 changecount;
+} PgStatSharedSLRUStats;
+
+StaticAssertDecl(sizeof(TimestampTz) == sizeof(pg_atomic_uint64),
+                 "size of pg_atomic_uint64 doesn't match TimestampTz");
+
+typedef struct StatsShmemStruct
+{
+    dsa_handle    stats_dsa_handle;    /* handle for stats data area */
+    dshash_table_handle hash_handle;    /* shared dbstat hash */
+    int            refcount;        /* # of processes that is attaching the shared
+                                 * stats memory */
+    /* Global stats structs */
+    PgStat_Archiver archiver_stats;
+    pg_atomic_uint32 archiver_changecount;
+    PgStat_BgWriter bgwriter_stats;
+    pg_atomic_uint32 bgwriter_changecount;
+    PgStat_CheckPointer checkpointer_stats;
+    pg_atomic_uint32 checkpointer_changecount;
+    PgStat_Wal    wal_stats;
+    LWLock        wal_stats_lock;
+    PgStatSharedSLRUStats slru_stats;
+    pg_atomic_uint32 slru_changecount;
+    pg_atomic_uint64 stats_timestamp;
+
+    /* Reset offsets, protected by StatsLock */
+    PgStat_Archiver archiver_reset_offset;
+    PgStat_BgWriter bgwriter_reset_offset;
+    PgStat_CheckPointer checkpointer_reset_offset;
+
+    /* file read/write protection */
+    bool        attach_holdoff;
+    ConditionVariable holdoff_cv;
+
+    pg_atomic_uint64 gc_count;    /* # of entries deleted. not protected by
+                                 * StatsLock */
+} StatsShmemStruct;
+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal    WalStats = {0};
+
 /*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
  */
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+    static const PgStat_Wal all_zeroes;
+
+    return memcmp(&WalStats, &all_zeroes,
+                  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}
+
+/*
+ * SLRU statistics counts waiting to be written to the shared activity
+ * statistics.  We assume this variable inits to zeroes.  Entries are
+ * one-to-one with slru_names[].
+ * Changes of SLRU counters are reported within critical sections so we use
+ * static memory in order to avoid memory allocation.
+ */
+static PgStat_SLRUStats local_SLRUStats[SLRU_NUM_ELEMENTS];
+static bool have_slrustats = false;
 
 /* ----------
  * Local data
  * ----------
  */
-NON_EXEC_STATIC pgsocket pgStatSock = PGINVALID_SOCKET;
-
-static struct sockaddr_storage pgStatAddr;
-
-static time_t last_pgstat_start_time;
-
-static bool pgStatRunningInCollector = false;
+/* backend-lifetime storages */
+static StatsShmemStruct *StatsShmem = NULL;
+static dsa_area *area = NULL;
 
 /*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
- *
- * NOTE: once allocated, TabStatusArray structures are never moved or deleted
- * for the life of the backend.  Also, we zero out the t_id fields of the
- * contained PgStat_TableStatus structs whenever they are not actively in use.
- * This allows relcache pgstat_info pointers to be treated as long-lived data,
- * avoiding repeated searches in pgstat_initstats() when a relation is
- * repeatedly opened during a transaction.
+ * Types to define shared statistics structure.
+ *
+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory. All shared stats are pointed from a dshash via a dsa_pointer. This
+ * structure make the shared stats immovable against dshash resizing, allows a
+ * backend point to shared stats entries via a native pointer and allows
+ * locking at stats-entry level. The per-entry locking reduces lock contention
+ * compared to partition lock of dshash. A backend accumulates stats numbers in
+ * a stats entry in the local memory space then flushes the numbers to shared
+ * stats entries at basically transaction end.
+ *
+ * Each stat entry type has a fixed member PgStat_HashEntryHeader as the first
+ * element.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry                    (dshash entry)
+ *      (dsa_pointer)-> PgStat_Stat*Entry    (dsa memory block)
+ *
+ * Shared stats entries are directly pointed from pgstat_localhash hash:
+ *
+ * pgstat_localhash pgStatEntHash
+ *    -> PgStatLocalHashEntry                (equivalent of PgStatHashEntry)
+ *      (native pointer)-> PgStat_Stat*Entry (dsa memory block)
+ *
+ * Local stats that are waiting for being flushed to share stats are stored as:
+ *
+ * pgstat_localhash pgStatLocalHash
+ *    -> PgStatLocalHashEntry                 (local hash entry)
+ *      (native pointer)-> PgStat_Stat*Entry/TableStatus (palloc'ed memory)
  */
-#define TABSTAT_QUANTUM        100 /* we alloc this many at a time */
 
-typedef struct TabStatusArray
+/* The types of statistics entries */
+typedef enum PgStatTypes
 {
-    struct TabStatusArray *tsa_next;    /* link to next array, if any */
-    int            tsa_used;        /* # entries currently used */
-    PgStat_TableStatus tsa_entries[TABSTAT_QUANTUM];    /* per-table data */
-} TabStatusArray;
-
-static TabStatusArray *pgStatTabList = NULL;
+    PGSTAT_TYPE_DB,                /* database-wide statistics */
+    PGSTAT_TYPE_TABLE,            /* per-table statistics */
+    PGSTAT_TYPE_FUNCTION,        /* per-function statistics */
+    PGSTAT_TYPE_REPLSLOT        /* per-replication-slot statistics */
+} PgStatTypes;
 
 /*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry body size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
  */
-typedef struct TabStatHashEntry
+static const size_t pgstat_sharedentsize[] =
 {
-    Oid            t_id;
-    PgStat_TableStatus *tsa_entry;
-} TabStatHashEntry;
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_StatTabEntry),    /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_StatFuncEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)        /* PGSTAT_TYPE_REPLSLOT */
+};
 
-/*
- * Hash table for O(1) t_id -> tsa_entry lookup
- */
-static HTAB *pgStatTabHash = NULL;
+/* Ditto for local statistics entries */
+static const size_t pgstat_localentsize[] =
+{
+    sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+    sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+    sizeof(PgStat_BackendFunctionEntry),    /* PGSTAT_TYPE_FUNCTION */
+    sizeof(PgStat_ReplSlot)        /* PGSTAT_TYPE_REPLSLOT */
+};
 
 /*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * We shoud avoid overwriting header part of a shared entry. Use these macros
+ * to know what portion of the struct to be written or read. PSTAT_SHENT_BODY
+ * returns a bit smaller address than the actual address of the next member but
+ * that doesn't matter.
  */
-static HTAB *pgStatFunctions = NULL;
+#define PGSTAT_SHENT_BODY(e) (((char *)(e)) + sizeof(PgStat_StatEntryHeader))
+#define PGSTAT_SHENT_BODY_LEN(t) \
+    (pgstat_sharedentsize[t] - sizeof(PgStat_StatEntryHeader))
+
+/* struct for shared statistics hash entry key. */
+typedef struct PgStatHashKey
+{
+    PgStatTypes type;            /* statistics entry type */
+    Oid            databaseid;        /* database ID. InvalidOid for shared objects. */
+    Oid            objectid;        /* object ID, either table or function. */
+} PgStatHashKey;
+
+/* struct for shared statistics hash entry */
+typedef struct PgStatHashEntry
+{
+    PgStatHashKey key;            /* hash key */
+    dsa_pointer body;            /* pointer to shared stats in
+                                 * PgStat_StatEntryHeader */
+} PgStatHashEntry;
+
+/* struct for shared statistics local hash entry. */
+typedef struct PgStatLocalHashEntry
+{
+    PgStatHashKey key;            /* hash key */
+    char        status;            /* for simplehash use */
+    PgStat_StatEntryHeader *body;    /* address pointer to stats body */
+    dsa_pointer dsapointer;        /* dsa pointer of body */
+} PgStatLocalHashEntry;
+
+/* parameter for the shared hash */
+static const dshash_parameters dsh_params = {
+    sizeof(PgStatHashKey),
+    sizeof(PgStatHashEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS
+};
+
+/* define hashtable for local hashes */
+#define SH_PREFIX pgstat_localhash
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) \
+    hash_bytes((unsigned char *)&key, sizeof(PgStatHashKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
+/* The shared hash to index activity stats entries. */
+static dshash_table *pgStatSharedHash = NULL;
 
 /*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
+ * The local cache to index shared stats entries.
+ *
+ * This is a local hash to store native pointers to shared hash
+ * entries. pgStatEntHashAge is copied from StatsShmem->gc_count at creation
+ * and garbage collection.
  */
-static bool have_function_stats = false;
+static pgstat_localhash_hash * pgStatEntHash = NULL;
+static int    pgStatEntHashAge = 0;    /* cache age of pgStatEntHash */
+
+/* Local stats numbers are stored here. */
+static pgstat_localhash_hash * pgStatLocalHash = NULL;
+
+/* entry type for oid hash */
+typedef struct pgstat_oident
+{
+    Oid            oid;
+    char        status;
+} pgstat_oident;
+
+/* Define hashtable for OID hashes. */
+StaticAssertDecl(sizeof(Oid) == 4, "oid is not compatible with uint32");
+#define SH_PREFIX pgstat_oid
+#define SH_ELEMENT_TYPE pgstat_oident
+#define SH_KEY_TYPE Oid
+#define SH_KEY oid
+#define SH_HASH_KEY(tb, key) hash_bytes_uint32(key)
+#define SH_EQUAL(tb, a, b) (a == b)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
 
 /*
  * Tuple insertion/deletion counts for an open transaction can't be propagated
@@ -276,11 +415,8 @@ typedef struct TwoPhasePgStatRecord
     bool        t_truncated;    /* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
 static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
 
 /* Status for backends including auxiliary */
 static LocalPgBackendStatus *localBackendStatusTable = NULL;
@@ -289,23 +425,9 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Make our own memory context to make it easy to track memory usage.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_WalStats walStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-static PgStat_ReplSlotStats *replSlotStats;
-static int    nReplSlotStats;
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+MemoryContext pgStatCacheContext = NULL;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -314,41 +436,58 @@ static List *pending_write_requests = NIL;
  */
 static instr_time total_func_time;
 
+/* Simple caching feature for pgstatfuncs */
+static PgStatHashKey stathashkey_zero = {0};
+static PgStatHashKey cached_dbent_key = {0};
+static PgStat_StatDBEntry cached_dbent;
+static PgStatHashKey cached_tabent_key = {0};
+static PgStat_StatTabEntry cached_tabent;
+static PgStatHashKey cached_funcent_key = {0};
+static PgStat_StatFuncEntry cached_funcent;
+
+static PgStat_Archiver cached_archiverstats;
+static bool cached_archiverstats_is_valid = false;
+static PgStat_BgWriter cached_bgwriterstats;
+static bool cached_bgwriterstats_is_valid = false;
+static PgStat_CheckPointer cached_checkpointerstats;
+static bool cached_checkpointerstats_is_valid = false;
+static PgStat_Wal cached_walstats;
+static bool cached_walstats_is_valid = false;
+static PgStat_SLRUStats cached_slrustats;
+static bool cached_slrustats_is_valid = false;
+static PgStat_ReplSlot *cached_replslotstats = NULL;
+static int    n_cached_replslotstats = -1;
 
 /* ----------
  * Local function forward declarations
  * ----------
  */
-#ifdef EXEC_BACKEND
-static pid_t pgstat_forkexec(void);
-#endif
-
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
-static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                                                 Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatDBEntry *get_local_dbstat_entry(Oid dbid);
+static PgStat_TableStatus *get_local_tabstat_entry(Oid rel_id, bool isshared);
+
+static void pgstat_write_statsfile(void);
+
+static void pgstat_read_statsfile(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
+static PgStat_StatEntryHeader *get_stat_entry(PgStatTypes type, Oid dbid,
+                                              Oid objid, bool nowait,
+                                              bool create, bool *found);
 
-static int    pgstat_replslot_index(const char *name, bool create_it);
-static void pgstat_reset_replslot(int i, TimestampTz ts);
+static bool flush_tabstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_funcstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_dbstat(PgStatLocalHashEntry *ent, bool nowait);
+static bool flush_walstat(bool nowait);
+static bool flush_slrustat(bool nowait);
+static void delete_current_stats_entry(dshash_seq_status *hstat);
+static PgStat_StatEntryHeader *get_local_stat_entry(PgStatTypes type, Oid dbid,
+                                                    Oid objid, bool create,
+                                                    bool *found);
 
-static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
-static void pgstat_send_funcstats(void);
-static void pgstat_send_slru(void);
-static HTAB *pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid);
-static void pgstat_send_connstats(bool disconnect, TimestampTz last_report);
-
-static PgStat_TableStatus *get_tabstat_entry(Oid rel_id, bool isshared);
+static pgstat_oid_hash * collect_oids(Oid catalogid, AttrNumber anum_oid);
+static void pgstat_update_connstats(bool disconnect);
 
 static void pgstat_setup_memcxt(void);
 
@@ -358,492 +497,651 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 
-static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
-static void pgstat_send(void *msg, int len);
-
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
-static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
-static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
-static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
-static void pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len);
-static void pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len);
-static void pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len);
-static void pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len);
-static void pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg, int len);
-static void pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len);
-static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
-static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
-static void pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len);
-static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
-static void pgstat_recv_wal(PgStat_MsgWal *msg, int len);
-static void pgstat_recv_slru(PgStat_MsgSLRU *msg, int len);
-static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
-static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
-static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len);
-static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
-static void pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len);
-static void pgstat_recv_connstat(PgStat_MsgConn *msg, int len);
-static void pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len);
-static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
 /* ------------------------------------------------------------
  * Public functions called from postmaster follow
  * ------------------------------------------------------------
  */
 
-/* ----------
- * pgstat_init() -
- *
- *    Called from postmaster at startup. Create the resources required
- *    by the statistics collector process.  If unable to do so, do not
- *    fail --- better to let the postmaster start with stats collection
- *    disabled.
- * ----------
+/*
+ * StatsShmemSize
+ *        Compute shared memory space needed for activity statistic
+ */
+Size
+StatsShmemSize(void)
+{
+    return sizeof(StatsShmemStruct);
+}
+
+/*
+ * StatsShmemInit - initialize during shared-memory creation
  */
 void
-pgstat_init(void)
+StatsShmemInit(void)
 {
-    ACCEPT_TYPE_ARG3 alen;
-    struct addrinfo *addrs = NULL,
-               *addr,
-                hints;
-    int            ret;
-    fd_set        rset;
-    struct timeval tv;
-    char        test_byte;
-    int            sel_res;
-    int            tries = 0;
-
-#define TESTBYTEVAL ((char) 199)
+    bool        found;
 
-    /*
-     * This static assertion verifies that we didn't mess up the calculations
-     * involved in selecting maximum payload sizes for our UDP messages.
-     * Because the only consequence of overrunning PGSTAT_MAX_MSG_SIZE would
-     * be silent performance loss from fragmentation, it seems worth having a
-     * compile-time cross-check that we didn't.
-     */
-    StaticAssertStmt(sizeof(PgStat_Msg) <= PGSTAT_MAX_MSG_SIZE,
-                     "maximum stats message size exceeds PGSTAT_MAX_MSG_SIZE");
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(), &found);
 
-    /*
-     * Create the UDP socket for sending and receiving statistic messages
-     */
-    hints.ai_flags = AI_PASSIVE;
-    hints.ai_family = AF_UNSPEC;
-    hints.ai_socktype = SOCK_DGRAM;
-    hints.ai_protocol = 0;
-    hints.ai_addrlen = 0;
-    hints.ai_addr = NULL;
-    hints.ai_canonname = NULL;
-    hints.ai_next = NULL;
-    ret = pg_getaddrinfo_all("localhost", NULL, &hints, &addrs);
-    if (ret || !addrs)
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        ConditionVariableInit(&StatsShmem->holdoff_cv);
+        pg_atomic_init_u32(&StatsShmem->archiver_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->bgwriter_changecount, 0);
+        pg_atomic_init_u32(&StatsShmem->checkpointer_changecount, 0);
+
+        pg_atomic_init_u64(&StatsShmem->gc_count, 0);
+
+        LWLockInitialize(&StatsShmem->wal_stats_lock, LWTRANCHE_STATS);
+    }
+}
+
+/* ----------
+ * allow_next_attacher() -
+ *
+ *  Let other processes to go ahead attaching the shared stats area.
+ * ----------
+ */
+static void
+allow_next_attacher(void)
+{
+    bool        triggerd = false;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->attach_holdoff)
     {
-        ereport(LOG,
-                (errmsg("could not resolve \"localhost\": %s",
-                        gai_strerror(ret))));
-        goto startup_failed;
+        StatsShmem->attach_holdoff = false;
+        triggerd = true;
     }
+    LWLockRelease(StatsLock);
+
+    if (triggerd)
+        ConditionVariableBroadcast(&StatsShmem->holdoff_cv);
+}
+
+/* ----------
+ * attach_shared_stats() -
+ *
+ *    Attach shared or create stats memory. If we are the first process to use
+ *    activity stats system, read the saved statistics file if any.
+ * ---------
+ */
+static void
+attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
 
     /*
-     * On some platforms, pg_getaddrinfo_all() may return multiple addresses
-     * only one of which will actually work (eg, both IPv6 and IPv4 addresses
-     * when kernel will reject IPv6).  Worse, the failure may occur at the
-     * bind() or perhaps even connect() stage.  So we must loop through the
-     * results till we find a working combination. We will generate LOG
-     * messages, but no error, for bogus combinations.
+     * Don't use dsm under postmaster, or when not tracking counts.
      */
-    for (addr = addrs; addr; addr = addr->ai_next)
-    {
-#ifdef HAVE_UNIX_SOCKETS
-        /* Ignore AF_UNIX sockets, if any are returned. */
-        if (addr->ai_family == AF_UNIX)
-            continue;
-#endif
+    if (!pgstat_track_counts || !IsUnderPostmaster)
+        return;
 
-        if (++tries > 1)
-            ereport(LOG,
-                    (errmsg("trying another address for the statistics collector")));
+    pgstat_setup_memcxt();
 
-        /*
-         * Create the socket.
-         */
-        if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) == PGINVALID_SOCKET)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not create socket for statistics collector: %m")));
-            continue;
-        }
+    if (area)
+        return;
 
-        /*
-         * Bind it to a kernel assigned port on localhost and get the assigned
-         * port via getsockname().
-         */
-        if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not bind socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        alen = sizeof(pgStatAddr);
-        if (getsockname(pgStatSock, (struct sockaddr *) &pgStatAddr, &alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not get address of socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /* stats shared memory persists for the backend lifetime */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
 
-        /*
-         * Connect the socket to its own address.  This saves a few cycles by
-         * not having to respecify the target address on every send. This also
-         * provides a kernel-level check that only packets from this same
-         * address will be received.
-         */
-        if (connect(pgStatSock, (struct sockaddr *) &pgStatAddr, alen) < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not connect socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+    /*
+     * The first attacher backend may still reading the stats file, or the
+     * last detacher may writing it. Wait for the work to finish.
+     */
+    ConditionVariablePrepareToSleep(&StatsShmem->holdoff_cv);
+    for (;;)
+    {
+        bool        hold_off;
 
-        /*
-         * Try to send and receive a one-byte test message on the socket. This
-         * is to catch situations where the socket can be created but will not
-         * actually pass data (for instance, because kernel packet filtering
-         * rules prevent it).
-         */
-        test_byte = TESTBYTEVAL;
-
-retry1:
-        if (send(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry1;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not send test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
+        LWLockAcquire(StatsLock, LW_SHARED);
+        hold_off = StatsShmem->attach_holdoff;
+        LWLockRelease(StatsLock);
 
-        /*
-         * There could possibly be a little delay before the message can be
-         * received.  We arbitrarily allow up to half a second before deciding
-         * it's broken.
-         */
-        for (;;)                /* need a loop to handle EINTR */
-        {
-            FD_ZERO(&rset);
-            FD_SET(pgStatSock, &rset);
-
-            tv.tv_sec = 0;
-            tv.tv_usec = 500000;
-            sel_res = select(pgStatSock + 1, &rset, NULL, NULL, &tv);
-            if (sel_res >= 0 || errno != EINTR)
-                break;
-        }
-        if (sel_res < 0)
-        {
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("select() failed in statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-        if (sel_res == 0 || !FD_ISSET(pgStatSock, &rset))
-        {
-            /*
-             * This is the case we actually think is likely, so take pains to
-             * give a specific message for it.
-             *
-             * errno will not be set meaningfully here, so don't use it.
-             */
-            ereport(LOG,
-                    (errcode(ERRCODE_CONNECTION_FAILURE),
-                     errmsg("test message did not get through on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        test_byte++;            /* just make sure variable is changed */
-
-retry2:
-        if (recv(pgStatSock, &test_byte, 1, 0) != 1)
-        {
-            if (errno == EINTR)
-                goto retry2;    /* if interrupted, just retry */
-            ereport(LOG,
-                    (errcode_for_socket_access(),
-                     errmsg("could not receive test message on socket for statistics collector: %m")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        if (test_byte != TESTBYTEVAL)    /* strictly paranoia ... */
-        {
-            ereport(LOG,
-                    (errcode(ERRCODE_INTERNAL_ERROR),
-                     errmsg("incorrect test message transmission on socket for statistics collector")));
-            closesocket(pgStatSock);
-            pgStatSock = PGINVALID_SOCKET;
-            continue;
-        }
-
-        /* If we get here, we have a working socket */
-        break;
+        if (!hold_off)
+            break;
+
+        ConditionVariableTimedSleep(&StatsShmem->holdoff_cv, 10,
+                                    WAIT_EVENT_READING_STATS_FILE);
     }
+    ConditionVariableCancelSleep();
 
-    /* Did we find a working address? */
-    if (!addr || pgStatSock == PGINVALID_SOCKET)
-        goto startup_failed;
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
 
     /*
-     * Set the socket to non-blocking IO.  This ensures that if the collector
-     * falls behind, statistics messages will be discarded; backends won't
-     * block waiting to send messages to the collector.
+     * The last process is responsible to write out stats files at exit.
+     * Maintain refcount so that a process going to exit can find whether it
+     * is the last one or not.
      */
-    if (!pg_set_noblock(pgStatSock))
+    if (StatsShmem->refcount > 0)
+        StatsShmem->refcount++;
+    else
     {
-        ereport(LOG,
-                (errcode_for_socket_access(),
-                 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-        goto startup_failed;
+        /* We're the first process to attach the shared stats memory */
+        Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+        /* Initialize shared memory area */
+        area = dsa_create(LWTRANCHE_STATS);
+        pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+        StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+        StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+        LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+        pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+        /* Block the next attacher for a while, see the comment above. */
+        StatsShmem->attach_holdoff = true;
+
+        StatsShmem->refcount = 1;
     }
 
-    /*
-     * Try to ensure that the socket's receive buffer is at least
-     * PGSTAT_MIN_RCVBUF bytes, so that it won't easily overflow and lose
-     * data.  Use of UDP protocol means that we are willing to lose data under
-     * heavy load, but we don't want it to happen just because of ridiculously
-     * small default buffer sizes (such as 8KB on older Windows versions).
-     */
+    LWLockRelease(StatsLock);
+
+    if (area)
     {
-        int            old_rcvbuf;
-        int            new_rcvbuf;
-        ACCEPT_TYPE_ARG3 rcvbufsize = sizeof(old_rcvbuf);
-
-        if (getsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                       (char *) &old_rcvbuf, &rcvbufsize) < 0)
-        {
-            ereport(LOG,
-                    (errmsg("getsockopt(%s) failed: %m", "SO_RCVBUF")));
-            /* if we can't get existing size, always try to set it */
-            old_rcvbuf = 0;
-        }
-
-        new_rcvbuf = PGSTAT_MIN_RCVBUF;
-        if (old_rcvbuf < new_rcvbuf)
-        {
-            if (setsockopt(pgStatSock, SOL_SOCKET, SO_RCVBUF,
-                           (char *) &new_rcvbuf, sizeof(new_rcvbuf)) < 0)
-                ereport(LOG,
-                        (errmsg("setsockopt(%s) failed: %m", "SO_RCVBUF")));
-        }
+        /*
+         * We're the first attacher process, read stats file while blocking
+         * successors.
+         */
+        Assert(StatsShmem->attach_holdoff);
+        pgstat_read_statsfile();
+        allow_next_attacher();
+    }
+    else
+    {
+        /* We're not the first one, attach existing shared area. */
+        area = dsa_attach(StatsShmem->stats_dsa_handle);
+        pgStatSharedHash = dshash_attach(area, &dsh_params,
+                                         StatsShmem->hash_handle, 0);
     }
 
-    pg_freeaddrinfo_all(hints.ai_family, addrs);
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
 
-    /* Now that we have a long-lived socket, tell fd.c about it. */
-    ReserveExternalFD();
+    MemoryContextSwitchTo(oldcontext);
 
-    return;
+    /* don't detach automatically */
+    dsa_pin_mapping(area);
+}
 
-startup_failed:
-    ereport(LOG,
-            (errmsg("disabling statistics collector for lack of working socket")));
+/* ----------
+ * cleanup_dropped_stats_entries() -
+ *              Clean up shared stats entries no longer used.
+ *
+ *  Shared stats entries for dropped objects may be left referenced. Clean up
+ *  our reference and drop the shared entry if needed.
+ * ----------
+ */
+static void
+cleanup_dropped_stats_entries(void)
+{
+    pgstat_localhash_iterator i;
+    PgStatLocalHashEntry *ent;
 
-    if (addrs)
-        pg_freeaddrinfo_all(hints.ai_family, addrs);
+    if (pgStatEntHash == NULL)
+        return;
 
-    if (pgStatSock != PGINVALID_SOCKET)
-        closesocket(pgStatSock);
-    pgStatSock = PGINVALID_SOCKET;
+    pgstat_localhash_start_iterate(pgStatEntHash, &i);
+    while ((ent = pgstat_localhash_iterate(pgStatEntHash, &i))
+           != NULL)
+    {
+        /*
+         * Free the shared memory chunk for the entry if we were the last
+         * referrer to a dropped entry.
+         */
+        if (pg_atomic_sub_fetch_u32(&ent->body->refcount, 1) < 1 &&
+            ent->body->dropped)
+            dsa_free(area, ent->dsapointer);
+    }
 
     /*
-     * Adjust GUC variables to suppress useless activity, and for debugging
-     * purposes (seeing track_counts off is a clue that we failed here). We
-     * use PGC_S_OVERRIDE because there is no point in trying to turn it back
-     * on from postgresql.conf without a restart.
+     * This function is expected to be called during backend exit. So we don't
+     * bother destroying pgStatEntHash.
      */
-    SetConfigOption("track_counts", "off", PGC_INTERNAL, PGC_S_OVERRIDE);
+    pgStatEntHash = NULL;
 }
 
-/*
- * subroutine for pgstat_reset_all
+/* ----------
+ * detach_shared_stats() -
+ *
+ *    Detach shared stats. Write out to file if we're the last process and told
+ *    to do so.
+ * ----------
  */
 static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_file)
 {
-    DIR           *dir;
-    struct dirent *entry;
-    char        fname[MAXPGPATH * 2];
+    bool        is_last_detacher = 0;
+
+    /* immediately return if useless */
+    if (!area || !IsUnderPostmaster)
+        return;
+
+    /* We shouldn't leave a reference to shared stats. */
+    cleanup_dropped_stats_entries();
 
-    dir = AllocateDir(directory);
-    while ((entry = ReadDir(dir, directory)) != NULL)
+    /*
+     * If we are the last detacher, hold off the next attacher (if possible)
+     * until we finish writing stats file.
+     */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (--StatsShmem->refcount == 0)
     {
-        int            nchars;
-        Oid            tmp_oid;
+        StatsShmem->attach_holdoff = true;
+        is_last_detacher = true;
+    }
+    LWLockRelease(StatsLock);
 
-        /*
-         * Skip directory entries that don't match the file names we write.
-         * See get_dbstat_filename for the database-specific pattern.
-         */
-        if (strncmp(entry->d_name, "global.", 7) == 0)
-            nchars = 7;
-        else
-        {
-            nchars = 0;
-            (void) sscanf(entry->d_name, "db_%u.%n",
-                          &tmp_oid, &nchars);
-            if (nchars <= 0)
-                continue;
-            /* %u allows leading whitespace, so reject that */
-            if (strchr("0123456789", entry->d_name[3]) == NULL)
-                continue;
-        }
-
-        if (strcmp(entry->d_name + nchars, "tmp") != 0 &&
-            strcmp(entry->d_name + nchars, "stat") != 0)
-            continue;
-
-        snprintf(fname, sizeof(fname), "%s/%s", directory,
-                 entry->d_name);
-        unlink(fname);
+    if (is_last_detacher)
+    {
+        if (write_file)
+            pgstat_write_statsfile();
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+        /* allow the next attacher, if any */
+        allow_next_attacher();
     }
-    FreeDir(dir);
+
+    /*
+     * Detach the area. It is automatically destroyed when the last process
+     * detached it.
+     */
+    dsa_detach(area);
+
+    area = NULL;
+    pgStatSharedHash = NULL;
+
+    /* We are going to exit. Don't bother destroying local hashes. */
+    pgStatLocalHash = NULL;
 }
 
 /*
  * pgstat_reset_all() -
  *
- * Remove the stats files.  This is currently used only if WAL
- * recovery is needed after a crash.
+ * Remove the stats file.  This is currently used only if WAL recovery is
+ * needed after a crash.
  */
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
-    pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
+    /* standalone server doesn't use shared stats */
+    if (!IsUnderPostmaster)
+        return;
+
+    /* we must have shared stats attached */
+    Assert(StatsShmem->stats_dsa_handle != DSM_HANDLE_INVALID);
+
+    /* Startup must be the only user of shared stats */
+    Assert(StatsShmem->refcount == 1);
+
+    /*
+     * We could directly remove files and recreate the shared memory area. But
+     * just discard  then create for simplicity.
+     */
+    detach_shared_stats(false);
+    attach_shared_stats();
 }
 
-#ifdef EXEC_BACKEND
 
 /*
- * pgstat_forkexec() -
- *
- * Format up the arglist for, then fork and exec, statistics collector process
+ * fetch_lock_statentry - common helper function to fetch and lock a stats
+ * entry for flush_tabstat, flush_funcstat and flush_dbstat.
  */
-static pid_t
-pgstat_forkexec(void)
+static PgStat_StatEntryHeader *
+fetch_lock_statentry(PgStatTypes type, Oid dboid, Oid objid, bool nowait)
 {
-    char       *av[10];
-    int            ac = 0;
+    PgStat_StatEntryHeader *header;
+
+    /* find shared table stats entry corresponding to the local entry */
+    header = (PgStat_StatEntryHeader *)
+        get_stat_entry(type, dboid, objid, nowait, true, NULL);
 
-    av[ac++] = "postgres";
-    av[ac++] = "--forkcol";
-    av[ac++] = NULL;            /* filled in by postmaster_forkexec */
+    /* skip if dshash failed to acquire lock */
+    if (header == NULL)
+        return false;
 
-    av[ac] = NULL;
-    Assert(ac < lengthof(av));
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&header->lock, LW_EXCLUSIVE))
+        return false;
 
-    return postmaster_forkexec(ac, av);
+    return header;
 }
-#endif                            /* EXEC_BACKEND */
 
 
-/*
- * pgstat_start() -
+/* ----------
+ * get_stat_entry() -
  *
- *    Called from postmaster at startup or after an existing collector
- *    died.  Attempt to fire up a fresh statistics collector.
+ *    get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
  *
- *    Returns PID of child process, or 0 if fail.
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+               bool *found)
+{
+    PgStatHashEntry *shhashent;
+    PgStatLocalHashEntry *lohashent;
+    PgStat_StatEntryHeader *shheader = NULL;
+    PgStatHashKey key;
+    bool        shfound;
+
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+
+    if (pgStatEntHash)
+    {
+        uint64        currage;
+
+        /*
+         * pgStatEntHashAge increments quite slowly than the time the
+         * following loop takes so this is expected to iterate no more than
+         * twice.
+         */
+        while (unlikely
+               (pgStatEntHashAge !=
+                (currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+        {
+            pgstat_localhash_iterator i;
+
+            /*
+             * Some entries have been dropped. Invalidate cache pointer to
+             * them.
+             */
+            pgstat_localhash_start_iterate(pgStatEntHash, &i);
+            while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+                   != NULL)
+            {
+                PgStat_StatEntryHeader *header = lohashent->body;
+
+                if (header->dropped)
+                {
+                    pgstat_localhash_delete(pgStatEntHash, key);
+
+                    if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+                    {
+                        /*
+                         * We're the last referrer to this entry, drop the
+                         * shared entry.
+                         */
+                        dsa_free(area, lohashent->dsapointer);
+                    }
+                }
+            }
+
+            pgStatEntHashAge = currage;
+        }
+
+        lohashent = pgstat_localhash_lookup(pgStatEntHash, key);
+
+        if (lohashent)
+        {
+            if (found)
+                *found = true;
+            return lohashent->body;
+        }
+    }
+
+    shhashent = dshash_find_extended(pgStatSharedHash, &key,
+                                     create, nowait, create, &shfound);
+    if (shhashent)
+    {
+        if (create && !shfound)
+        {
+            /* Create new stats entry. */
+            dsa_pointer chunk = dsa_allocate0(area,
+                                              pgstat_sharedentsize[type]);
+
+            shheader = dsa_get_address(area, chunk);
+            LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+            pg_atomic_init_u32(&shheader->refcount, 0);
+
+            /* Link the new entry from the hash entry. */
+            shhashent->body = chunk;
+        }
+        else
+            shheader = dsa_get_address(area, shhashent->body);
+
+        /*
+         * We expose this shared entry now.  You might think that the entry
+         * can be removed by a concurrent backend, but since we are creating
+         * an stats entry, the object actually exists and used in the upper
+         * layer. Such an object cannot be dropped until the first vacuum
+         * after the current transaction ends.
+         */
+        dshash_release_lock(pgStatSharedHash, shhashent);
+
+        /* register to local hash if possible */
+        if (pgStatEntHash || pgStatCacheContext)
+        {
+            bool        lofound;
+
+            if (pgStatEntHash == NULL)
+            {
+                pgStatEntHash =
+                    pgstat_localhash_create(pgStatCacheContext,
+                                            PGSTAT_TABLE_HASH_SIZE, NULL);
+                pgStatEntHashAge =
+                    pg_atomic_read_u64(&StatsShmem->gc_count);
+            }
+
+            lohashent =
+                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+            Assert(!lofound);
+            lohashent->body = shheader;
+            lohashent->dsapointer = shhashent->body;
+
+            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+        }
+    }
+
+    if (found)
+        *found = shfound;
+
+    return shheader;
+}
+
+/*
+ * flush_walstat - flush out a local WAL stats entries
  *
- *    Note: if fail, we will be called again from the postmaster main loop.
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if all local WAL stats are successfully flushed out.
  */
-int
-pgstat_start(void)
+static bool
+flush_walstat(bool nowait)
 {
-    time_t        curtime;
-    pid_t        pgStatPid;
+    PgStat_Wal *s = &StatsShmem->wal_stats;
+    PgStat_Wal *l = &WalStats;
+    WalUsage    all_zeroes PG_USED_FOR_ASSERTS_ONLY = {0};
 
     /*
-     * Check that the socket is there, else pgstat_init failed and we can do
-     * nothing useful.
+     * We don't update the WAL usage portion of the local WalStats elsewhere.
+     * Instead, fill in that portion with the difference of pgWalUsage since
+     * the previous call.
      */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return 0;
+    Assert(memcmp(&l->wal_usage, &all_zeroes, sizeof(WalUsage)) == 0);
+    WalUsageAccumDiff(&l->wal_usage, &pgWalUsage, &prevWalUsage);
 
     /*
-     * Do nothing if too soon since last collector start.  This is a safety
-     * valve to protect against continuous respawn attempts if the collector
-     * is dying immediately at launch.  Note that since we will be re-called
-     * from the postmaster main loop, we will get another chance later.
+     * This function can be called even if nothing at all has happened. Avoid
+     * taking lock for nothing in that case.
      */
-    curtime = time(NULL);
-    if ((unsigned int) (curtime - last_pgstat_start_time) <
-        (unsigned int) PGSTAT_RESTART_INTERVAL)
-        return 0;
-    last_pgstat_start_time = curtime;
+    if (!walstats_pending())
+        return true;
+
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->wal_stats_lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    s->wal_usage.wal_records += l->wal_usage.wal_records;
+    s->wal_usage.wal_fpi += l->wal_usage.wal_fpi;
+    s->wal_usage.wal_bytes += l->wal_usage.wal_bytes;
+    s->wal_buffers_full += l->wal_buffers_full;
+    s->wal_write += l->wal_write;
+    s->wal_write_time += l->wal_write_time;
+    s->wal_sync += l->wal_sync;
+    s->wal_sync_time += l->wal_sync_time;
+    LWLockRelease(&StatsShmem->wal_stats_lock);
 
     /*
-     * Okay, fork off the collector.
+     * Save the current counters for the subsequent calculation of WAL usage.
      */
-#ifdef EXEC_BACKEND
-    switch ((pgStatPid = pgstat_forkexec()))
-#else
-    switch ((pgStatPid = fork_process()))
-#endif
-    {
-        case -1:
-            ereport(LOG,
-                    (errmsg("could not fork statistics collector: %m")));
-            return 0;
+    prevWalUsage = pgWalUsage;
 
-#ifndef EXEC_BACKEND
-        case 0:
-            /* in postmaster child ... */
-            InitPostmasterChild();
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&WalStats, 0, sizeof(WalStats));
 
-            /* Close the postmaster's sockets */
-            ClosePostmasterPorts(false);
+    return true;
+}
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
+/*
+ * flush_slrustat - flush out a local SLRU stats entries
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true. Writer processes are mutually excluded
+ * using LWLock, but readers are expected to use change-count protocol to avoid
+ * interference with writers.
+ *
+ * Returns true if all local SLRU stats are successfully flushed out.
+ */
+static bool
+flush_slrustat(bool nowait)
+{
+    uint32        assert_changecount PG_USED_FOR_ASSERTS_ONLY;
+    int            i;
 
-            PgstatCollectorMain(0, NULL);
-            break;
-#endif
+    if (!have_slrustats)
+        return true;
 
-        default:
-            return (int) pgStatPid;
+    /* lock the shared entry to protect the content, skip if failed */
+    if (!nowait)
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+    else if (!LWLockConditionalAcquire(&StatsShmem->slru_stats.lock,
+                                       LW_EXCLUSIVE))
+        return false;            /* failed to acquire lock, skip */
+
+    assert_changecount =
+        pg_atomic_fetch_add_u32(&StatsShmem->slru_stats.changecount, 1);
+    Assert((assert_changecount & 1) == 0);
+
+    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
+    {
+        PgStat_SLRUStats *sharedent = &StatsShmem->slru_stats.entry[i];
+        PgStat_SLRUStats *localent = &local_SLRUStats[i];
+
+        sharedent->blocks_zeroed += localent->blocks_zeroed;
+        sharedent->blocks_hit += localent->blocks_hit;
+        sharedent->blocks_read += localent->blocks_read;
+        sharedent->blocks_written += localent->blocks_written;
+        sharedent->blocks_exists += localent->blocks_exists;
+        sharedent->flush += localent->flush;
+        sharedent->truncate += localent->truncate;
     }
 
-    /* shouldn't get here */
-    return 0;
+    /* done, clear the local entry */
+    MemSet(local_SLRUStats, 0,
+           sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS);
+
+    pg_atomic_add_fetch_u32(&StatsShmem->slru_stats.changecount, 1);
+    LWLockRelease(&StatsShmem->slru_stats.lock);
+
+    have_slrustats = false;
+
+    return true;
+}
+
+/* ----------
+ * delete_current_stats_entry()
+ *
+ *  Deletes the given shared entry from shared stats hash. The entry must be
+ *  exclusively locked.
+ * ----------
+ */
+static void
+delete_current_stats_entry(dshash_seq_status *hstat)
+{
+    dsa_pointer pdsa;
+    PgStat_StatEntryHeader *header;
+    PgStatHashEntry *ent;
+
+    ent = dshash_get_current(hstat);
+    pdsa = ent->body;
+    header = dsa_get_address(area, pdsa);
+
+    /* No one find this entry ever after. */
+    dshash_delete_current(hstat);
+
+    /*
+     * Let the referrers drop the entry if any.  Refcount won't be decremented
+     * until "dropped" is set true and StatsShmem->gc_count is incremented
+     * later. So we can check refcount to set dropped without holding a lock.
+     * If no one is referring this entry, free it immediately.
+     */
+
+    if (pg_atomic_read_u32(&header->refcount) > 0)
+        header->dropped = true;
+    else
+        dsa_free(area, pdsa);
+
+    return;
 }
 
-void
-allow_immediate_pgstat_restart(void)
+/* ----------
+ * get_local_stat_entry() -
+ *
+ *  Returns local stats entry for the type, dbid and objid.
+ *  If create is true, new entry is created if not yet.  found must be non-null
+ *  in the case.
+ *
+ *
+ *  The caller is responsible to initialize the statsbody part of the returned
+ *  memory.
+ * ----------
+ */
+static PgStat_StatEntryHeader *
+get_local_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+                     bool create, bool *found)
 {
-    last_pgstat_start_time = 0;
+    PgStatHashKey key;
+    PgStatLocalHashEntry *entry;
+
+    if (pgStatLocalHash == NULL)
+        pgStatLocalHash = pgstat_localhash_create(pgStatCacheContext,
+                                                  PGSTAT_TABLE_HASH_SIZE, NULL);
+
+    /* Find an entry or create a new one. */
+    key.type = type;
+    key.databaseid = dbid;
+    key.objectid = objid;
+    if (create)
+        entry = pgstat_localhash_insert(pgStatLocalHash, key, found);
+    else
+        entry = pgstat_localhash_lookup(pgStatLocalHash, key);
+
+    if (!create && !entry)
+        return NULL;
+
+    if (create && !*found)
+        entry->body = MemoryContextAllocZero(TopMemoryContext,
+                                             pgstat_localentsize[type]);
+
+    return entry->body;
 }
 
 /* ------------------------------------------------------------
@@ -851,159 +1149,404 @@ allow_immediate_pgstat_restart(void)
  *------------------------------------------------------------
  */
 
-
 /* ----------
  * pgstat_report_stat() -
  *
  *    Must be called by processes that performs DML: tcop/postgres.c, logical
- *    receiver processes, SPI worker, etc. to send the so far collected
- *    per-table and function usage statistics to the collector.  Note that this
- *    is called only when not within a transaction, so it is fair to use
+ *    receiver processes, SPI worker, etc. to apply the so far collected
+ *    per-table and function usage statistics to the shared statistics hashes.
+ *
+ *    Updates are applied not more frequent than the interval of
+ *    PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *    failure if force is false and there's no pending updates longer than
+ *    PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *    succeeding calls of this function.
+ *
+ *    Returns the time until the next timing when updates are applied in
+ *    milliseconds if there are no updates held for more than
+ *    PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *    Note that this is called only out of a transaction, so it is fine to use
  *    transaction stop time as an approximation of current time.
- *
- *    "disconnect" is "true" only for the last call before the backend
- *    exits.  This makes sure that no data is lost and that interrupted
- *    sessions are reported correctly.
- * ----------
+ *    ----------
  */
-void
-pgstat_report_stat(bool disconnect)
+long
+pgstat_report_stat(bool force)
 {
-    /* we assume this inits to all zeroes: */
-    static const PgStat_TableCounts all_zeroes;
-    static TimestampTz last_report = 0;
-
+    static TimestampTz next_flush = 0;
+    static TimestampTz pending_since = 0;
+    static long retry_interval = 0;
     TimestampTz now;
-    PgStat_MsgTabstat regular_msg;
-    PgStat_MsgTabstat shared_msg;
-    TabStatusArray *tsa;
+    bool        nowait;
     int            i;
+    uint64        oldval;
+
+    /* Return if not active */
+    if (area == NULL)
+        return 0;
 
     /* Don't expend a clock check if nothing to do */
-    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
-        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-        !have_function_stats && !disconnect)
-        return;
+    if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
+        return 0;
 
-    /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the backend is about to exit.
-     */
     now = GetCurrentTransactionStopTimestamp();
-    if (!disconnect &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
 
-    /* for backends, send connection statistics */
+    if (!force)
+    {
+        /*
+         * Don't flush stats too frequently.  Return the time to the next
+         * flush.
+         */
+        if (now < next_flush)
+        {
+            /* Record the epoch time if retrying. */
+            if (pending_since == 0)
+                pending_since = now;
+
+            return (next_flush - now) / 1000;
+        }
+
+        /* But, don't keep pending updates longer than PGSTAT_MAX_INTERVAL. */
+
+        if (pending_since > 0 &&
+            TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
+            force = true;
+    }
+
+    /* for backends, update connection statistics */
     if (MyBackendType == B_BACKEND)
-        pgstat_send_connstats(disconnect, last_report);
+        pgstat_update_connstats(false);
 
-    last_report = now;
+    /* don't wait for lock acquisition when !force */
+    nowait = !force;
 
-    /*
-     * Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
-     * entries it points to.  (Should we fail partway through the loop below,
-     * it's okay to have removed the hashtable already --- the only
-     * consequence is we'd get multiple entries for the same table in the
-     * pgStatTabList, and that's safe.)
-     */
-    if (pgStatTabHash)
-        hash_destroy(pgStatTabHash);
-    pgStatTabHash = NULL;
-
-    /*
-     * Scan through the TabStatusArray struct(s) to find tables that actually
-     * have counts, and build messages to send.  We have to separate shared
-     * relations from regular ones because the databaseid field in the message
-     * header has to depend on that.
-     */
-    regular_msg.m_databaseid = MyDatabaseId;
-    shared_msg.m_databaseid = InvalidOid;
-    regular_msg.m_nentries = 0;
-    shared_msg.m_nentries = 0;
-
-    for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+    if (pgStatLocalHash)
     {
-        for (i = 0; i < tsa->tsa_used; i++)
+        int            remains = 0;
+        pgstat_localhash_iterator i;
+        PgStatLocalHashEntry *lent;
+
+        /* Step 1: flush out other than database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
         {
-            PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-            PgStat_MsgTabstat *this_msg;
-            PgStat_TableEntry *this_ent;
+            bool        remove = false;
 
-            /* Shouldn't have any pending transaction-dependent counts */
-            Assert(entry->trans == NULL);
+            switch (lent->key.type)
+            {
+                case PGSTAT_TYPE_DB:
+                    /*
+                     * flush_tabstat applies some of stats numbers of flushed
+                     * entries into local and shared database stats. Treat them
+                     * separately later.
+                     */
+                    break;
+                case PGSTAT_TYPE_TABLE:
+                    if (flush_tabstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_FUNCTION:
+                    if (flush_funcstat(lent, nowait))
+                        remove = true;
+                    break;
+                case PGSTAT_TYPE_REPLSLOT:
+                    /* We don't have that kind of local entry */
+                    Assert(false);
+            }
 
-            /*
-             * Ignore entries that didn't accumulate any actual counts, such
-             * as indexes that were opened by the planner but not used.
-             */
-            if (memcmp(&entry->t_counts, &all_zeroes,
-                       sizeof(PgStat_TableCounts)) == 0)
+            if (!remove)
+            {
+                remains++;
                 continue;
+            }
 
-            /*
-             * OK, insert data into the appropriate message, and send if full.
-             */
-            this_msg = entry->t_shared ? &shared_msg : ®ular_msg;
-            this_ent = &this_msg->m_entry[this_msg->m_nentries];
-            this_ent->t_id = entry->t_id;
-            memcpy(&this_ent->t_counts, &entry->t_counts,
-                   sizeof(PgStat_TableCounts));
-            if (++this_msg->m_nentries >= PGSTAT_NUM_TABENTRIES)
+            /* Remove the successfully flushed entry */
+            pfree(lent->body);
+            lent->body = NULL;
+            pgstat_localhash_delete(pgStatLocalHash, lent->key);
+        }
+
+        /* Step 2: flush out database stats */
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
+        {
+            /* no other types of entry must be found here */
+            Assert(lent->key.type == PGSTAT_TYPE_DB);
+
+            if (flush_dbstat(lent, nowait))
             {
-                pgstat_send_tabstat(this_msg);
-                this_msg->m_nentries = 0;
+                remains--;
+                /* Remove the successfully flushed entry */
+                pfree(lent->body);
+                lent->body = NULL;
+                pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        /* zero out PgStat_TableStatus structs after use */
-        MemSet(tsa->tsa_entries, 0,
-               tsa->tsa_used * sizeof(PgStat_TableStatus));
-        tsa->tsa_used = 0;
+
+        if (remains <= 0)
+        {
+            pgstat_localhash_destroy(pgStatLocalHash);
+            pgStatLocalHash = NULL;
+        }
     }
 
+    /* flush wal stats */
+    flush_walstat(nowait);
+
+    /* flush SLRU stats */
+    flush_slrustat(nowait);
+
     /*
-     * Send partial messages.  Make sure that any pending xact commit/abort
-     * gets counted, even if there are no table stats to send.
+     * Publish the time of the last flush, but we don't notify the change of
+     * the timestamp itself. Readers will get sufficiently recent timestamp.
+     * If we failed to update the value, concurrent processes should have
+     * updated it to sufficiently recent time.
+     *
+     * XXX: The loop might be unnecessary for the reason above.
      */
-    if (regular_msg.m_nentries > 0 ||
-        pgStatXactCommit > 0 || pgStatXactRollback > 0)
-        pgstat_send_tabstat(®ular_msg);
-    if (shared_msg.m_nentries > 0)
-        pgstat_send_tabstat(&shared_msg);
+    oldval = pg_atomic_read_u64(&StatsShmem->stats_timestamp);
 
-    /* Now, send function statistics */
-    pgstat_send_funcstats();
+    for (i = 0; i < 10; i++)
+    {
+        if (oldval >= now ||
+            pg_atomic_compare_exchange_u64(&StatsShmem->stats_timestamp,
+                                           &oldval, (uint64) now))
+            break;
+    }
 
-    /* Send WAL statistics */
-    pgstat_report_wal();
+    /*
+     * Some of the local stats may have not been flushed due to lock
+     * contention.  If we have such pending local stats here, let the caller
+     * know the retry interval.
+     */
+    if (pgStatLocalHash != NULL || have_slrustats || walstats_pending())
+    {
+        /* Retain the epoch time */
+        if (pending_since == 0)
+            pending_since = now;
 
-    /* Finally send SLRU statistics */
-    pgstat_send_slru();
+        /* The interval is doubled at every retry. */
+        if (retry_interval == 0)
+            retry_interval = PGSTAT_RETRY_MIN_INTERVAL * 1000;
+        else
+            retry_interval = retry_interval * 2;
+
+        /*
+         * Determine the next retry interval so as not to get shorter than the
+         * previous interval.
+         */
+        if (!TimestampDifferenceExceeds(pending_since,
+                                        now + 2 * retry_interval,
+                                        PGSTAT_MAX_INTERVAL))
+            next_flush = now + retry_interval;
+        else
+        {
+            next_flush = pending_since + PGSTAT_MAX_INTERVAL * 1000;
+            retry_interval = next_flush - now;
+        }
+
+        return retry_interval / 1000;
+    }
+
+    /* Set the next time to update stats */
+    next_flush = now + PGSTAT_MIN_INTERVAL * 1000;
+    retry_interval = 0;
+    pending_since = 0;
+
+    return 0;
 }
 
 /*
- * Subroutine for pgstat_report_stat: finish and send a tabstat message
+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
  */
-static void
-pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
+static bool
+flush_tabstat(PgStatLocalHashEntry *ent, bool nowait)
 {
-    int            n;
-    int            len;
+    static const PgStat_TableCounts all_zeroes;
+    Oid            dboid;            /* database OID of the table */
+    PgStat_TableStatus *lstats; /* local stats entry  */
+    PgStat_StatTabEntry *shtabstats;    /* table entry of shared stats */
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_TABLE);
+    lstats = (PgStat_TableStatus *) ent->body;
+    dboid = ent->key.databaseid;
+
+    /*
+     * Ignore entries that didn't accumulate any actual counts, such as
+     * indexes that were opened by the planner but not used.
+     */
+    if (memcmp(&lstats->t_counts, &all_zeroes,
+               sizeof(PgStat_TableCounts)) == 0)
+    {
+        /* This local entry is going to be dropped, delink from relcache. */
+        pgstat_delinkstats(lstats->relation);
+        return true;
+    }
+
+    /* find shared table stats entry corresponding to the local entry */
+    shtabstats = (PgStat_StatTabEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_TABLE, dboid, ent->key.objectid,
+                             nowait);
+
+    if (shtabstats == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    /* add the values to the shared entry. */
+    shtabstats->numscans += lstats->t_counts.t_numscans;
+    shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+    shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+    shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+    /*
+     * If table was truncated or vacuum/analyze has ran, first reset the
+     * live/dead counters.
+     */
+    if (lstats->t_counts.t_truncated)
+    {
+        shtabstats->n_live_tuples = 0;
+        shtabstats->n_dead_tuples = 0;
+    }
+
+    shtabstats->n_live_tuples += lstats->t_counts.t_delta_live_tuples;
+    shtabstats->n_dead_tuples += lstats->t_counts.t_delta_dead_tuples;
+    shtabstats->changes_since_analyze += lstats->t_counts.t_changed_tuples;
+    shtabstats->inserts_since_vacuum += lstats->t_counts.t_tuples_inserted;
+    shtabstats->blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    shtabstats->blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* Clamp n_live_tuples in case of negative delta_live_tuples */
+    shtabstats->n_live_tuples = Max(shtabstats->n_live_tuples, 0);
+    /* Likewise for n_dead_tuples */
+    shtabstats->n_dead_tuples = Max(shtabstats->n_dead_tuples, 0);
+
+    LWLockRelease(&shtabstats->header.lock);
+
+    /* The entry is successfully flushed so the same to add to database stats */
+    ldbstats = get_local_dbstat_entry(dboid);
+    ldbstats->counts.n_tuples_returned += lstats->t_counts.t_tuples_returned;
+    ldbstats->counts.n_tuples_fetched += lstats->t_counts.t_tuples_fetched;
+    ldbstats->counts.n_tuples_inserted += lstats->t_counts.t_tuples_inserted;
+    ldbstats->counts.n_tuples_updated += lstats->t_counts.t_tuples_updated;
+    ldbstats->counts.n_tuples_deleted += lstats->t_counts.t_tuples_deleted;
+    ldbstats->counts.n_blocks_fetched += lstats->t_counts.t_blocks_fetched;
+    ldbstats->counts.n_blocks_hit += lstats->t_counts.t_blocks_hit;
+
+    /* This local entry is going to be dropped, delink from relcache. */
+    pgstat_delinkstats(lstats->relation);
+
+    return true;
+}
+
+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_BackendFunctionEntry *localent;    /* local stats entry */
+    PgStat_StatFuncEntry *shfuncent = NULL; /* shared stats entry */
+
+    Assert(ent->key.type == PGSTAT_TYPE_FUNCTION);
+    localent = (PgStat_BackendFunctionEntry *) ent->body;
+
+    /* localent always has non-zero content */
+
+    /* find shared table stats entry corresponding to the local entry */
+    shfuncent = (PgStat_StatFuncEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             ent->key.objectid, nowait);
+    if (shfuncent == NULL)
+        return false;            /* failed to acquire lock, skip */
+
+    shfuncent->f_numcalls += localent->f_counts.f_numcalls;
+    shfuncent->f_total_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_total_time);
+    shfuncent->f_self_time +=
+        INSTR_TIME_GET_MICROSEC(localent->f_counts.f_self_time);
+
+    LWLockRelease(&shfuncent->header.lock);
+
+    return true;
+}
+
+
+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+#define PGSTAT_ACCUM_DBCOUNT(sh, lo, item)        \
+    (sh)->counts.item += (lo)->counts.item
+
+static bool
+flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
+{
+    PgStat_StatDBEntry *localent;
+    PgStat_StatDBEntry *sharedent;
+
+    Assert(ent->key.type == PGSTAT_TYPE_DB);
+
+    localent = (PgStat_StatDBEntry *) ent->body;
+
+    /* find shared database stats entry corresponding to the local entry */
+    sharedent = (PgStat_StatDBEntry *)
+        fetch_lock_statentry(PGSTAT_TYPE_DB, ent->key.databaseid, InvalidOid,
+                             nowait);
+
+    if (!sharedent)
+        return false;            /* failed to acquire lock, skip */
+
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_returned);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_fetched);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_inserted);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_updated);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_tuples_deleted);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_blocks_fetched);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_blocks_hit);
+
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_deadlocks);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_temp_bytes);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_temp_files);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_checksum_failures);
 
-    /* It's unlikely we'd get here with no socket, but maybe not impossible */
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_session_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_active_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, total_idle_in_xact_time);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_abandoned);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_fatal);
+    PGSTAT_ACCUM_DBCOUNT(sharedent, localent, n_sessions_killed);
 
     /*
-     * Report and reset accumulated xact commit/rollback and I/O timings
-     * whenever we send a normal tabstat message
+     * Accumulate xact commit/rollback and I/O timings to stats entry of the
+     * current database.
      */
-    if (OidIsValid(tsmsg->m_databaseid))
+    if (OidIsValid(ent->key.databaseid))
     {
-        tsmsg->m_xact_commit = pgStatXactCommit;
-        tsmsg->m_xact_rollback = pgStatXactRollback;
-        tsmsg->m_block_read_time = pgStatBlockReadTime;
-        tsmsg->m_block_write_time = pgStatBlockWriteTime;
+        sharedent->counts.n_xact_commit += pgStatXactCommit;
+        sharedent->counts.n_xact_rollback += pgStatXactRollback;
+        sharedent->counts.n_block_read_time += pgStatBlockReadTime;
+        sharedent->counts.n_block_write_time += pgStatBlockWriteTime;
         pgStatXactCommit = 0;
         pgStatXactRollback = 0;
         pgStatBlockReadTime = 0;
@@ -1011,281 +1554,138 @@ pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg)
     }
     else
     {
-        tsmsg->m_xact_commit = 0;
-        tsmsg->m_xact_rollback = 0;
-        tsmsg->m_block_read_time = 0;
-        tsmsg->m_block_write_time = 0;
+        sharedent->counts.n_xact_commit = 0;
+        sharedent->counts.n_xact_rollback = 0;
+        sharedent->counts.n_block_read_time = 0;
+        sharedent->counts.n_block_write_time = 0;
     }
 
-    n = tsmsg->m_nentries;
-    len = offsetof(PgStat_MsgTabstat, m_entry[0]) +
-        n * sizeof(PgStat_TableEntry);
+    LWLockRelease(&sharedent->header.lock);
 
-    pgstat_setheader(&tsmsg->m_hdr, PGSTAT_MTYPE_TABSTAT);
-    pgstat_send(tsmsg, len);
+    return true;
 }
 
-/*
- * Subroutine for pgstat_report_stat: populate and send a function stat message
- */
-static void
-pgstat_send_funcstats(void)
-{
-    /* we assume this inits to all zeroes: */
-    static const PgStat_FunctionCounts all_zeroes;
-
-    PgStat_MsgFuncstat msg;
-    PgStat_BackendFunctionEntry *entry;
-    HASH_SEQ_STATUS fstat;
-
-    if (pgStatFunctions == NULL)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_FUNCSTAT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_nentries = 0;
-
-    hash_seq_init(&fstat, pgStatFunctions);
-    while ((entry = (PgStat_BackendFunctionEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        PgStat_FunctionEntry *m_ent;
-
-        /* Skip it if no counts accumulated since last time */
-        if (memcmp(&entry->f_counts, &all_zeroes,
-                   sizeof(PgStat_FunctionCounts)) == 0)
-            continue;
-
-        /* need to convert format of time accumulators */
-        m_ent = &msg.m_entry[msg.m_nentries];
-        m_ent->f_id = entry->f_id;
-        m_ent->f_numcalls = entry->f_counts.f_numcalls;
-        m_ent->f_total_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_total_time);
-        m_ent->f_self_time = INSTR_TIME_GET_MICROSEC(entry->f_counts.f_self_time);
-
-        if (++msg.m_nentries >= PGSTAT_NUM_FUNCENTRIES)
-        {
-            pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                        msg.m_nentries * sizeof(PgStat_FunctionEntry));
-            msg.m_nentries = 0;
-        }
-
-        /* reset the entry's counts */
-        MemSet(&entry->f_counts, 0, sizeof(PgStat_FunctionCounts));
-    }
-
-    if (msg.m_nentries > 0)
-        pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
-                    msg.m_nentries * sizeof(PgStat_FunctionEntry));
-
-    have_function_stats = false;
-}
-
-
 /* ----------
  * pgstat_vacuum_stat() -
  *
- *    Will tell the collector about objects he can get rid of.
+ *  Delete shared stat entries that are not in system catalogs.
+ *
+ *  To avoid holding exclusive lock on dshash for a long time, the process is
+ *  performed in three steps.
+ *
+ *   1: Collect existent oids of every kind of object.
+ *   2: Collect victim entries by scanning with shared lock.
+ *   3: Try removing every nominated entry without waiting for lock.
+ *
+ *  As the consequence of the last step, some entries may be left alone due to
+ *  lock failure, but as explained by the comment of pgstat_vacuum_stat, they
+ *  will be deleted by later vacuums.
  * ----------
  */
 void
 pgstat_vacuum_stat(void)
 {
-    HTAB       *htab;
-    PgStat_MsgTabpurge msg;
-    PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Read pg_database and make a list of OIDs of all existing databases
-     */
-    htab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);
-
-    /*
-     * Search the database hash table for dead databases and tell the
-     * collector to drop them.
-     */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
-    {
-        Oid            dbid = dbentry->databaseid;
-
-        CHECK_FOR_INTERRUPTS();
-
-        /* the DB entry for shared tables (with InvalidOid) is never dropped */
-        if (OidIsValid(dbid) &&
-            hash_search(htab, (void *) &dbid, HASH_FIND, NULL) == NULL)
-            pgstat_drop_database(dbid);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Lookup our own database entry; if not found, nothing more to do.
-     */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+    pgstat_oid_hash *dbids;        /* database ids */
+    pgstat_oid_hash *relids;    /* relation ids in the current database */
+    pgstat_oid_hash *funcids;    /* function ids in the current database */
+    int            nvictims = 0;    /* # of entries of the above */
+    dshash_seq_status dshstat;
+    PgStatHashEntry *ent;
+
+    /* we don't collect stats under standalone mode */
+    if (!IsUnderPostmaster)
         return;
 
-    /*
-     * Similarly to above, make a list of all known relations in this DB.
-     */
-    htab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);
+    /* collect oids of existent objects */
+    dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+    relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+    funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
 
-    /*
-     * Initialize our messages table counter to zero
-     */
-    msg.m_nentries = 0;
+    nvictims = 0;
 
-    /*
-     * Check for all tables listed in stats hashtable if they still exist.
-     */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&dshstat, pgStatSharedHash, true);
+    while ((ent = dshash_seq_next(&dshstat)) != NULL)
     {
-        Oid            tabid = tabentry->tableid;
-
         CHECK_FOR_INTERRUPTS();
 
-        if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+        /*
+         * Don't drop entries for other than database objects not of the
+         * current database.
+         */
+        if (ent->key.type != PGSTAT_TYPE_DB &&
+            ent->key.databaseid != MyDatabaseId)
             continue;
 
-        /*
-         * Not there, so add this table's Oid to the message
-         */
-        msg.m_tableid[msg.m_nentries++] = tabid;
-
-        /*
-         * If the message is full, send it out and reinitialize to empty
-         */
-        if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
-        {
-            len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-                + msg.m_nentries * sizeof(Oid);
-
-            pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-            msg.m_databaseid = MyDatabaseId;
-            pgstat_send(&msg, len);
-
-            msg.m_nentries = 0;
-        }
-    }
-
-    /*
-     * Send the rest
-     */
-    if (msg.m_nentries > 0)
-    {
-        len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-            + msg.m_nentries * sizeof(Oid);
-
-        pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-        msg.m_databaseid = MyDatabaseId;
-        pgstat_send(&msg, len);
-    }
-
-    /* Clean up */
-    hash_destroy(htab);
-
-    /*
-     * Now repeat the above steps for functions.  However, we needn't bother
-     * in the common case where no function stats are being collected.
-     */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
-    {
-        htab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
-
-        pgstat_setheader(&f_msg.m_hdr, PGSTAT_MTYPE_FUNCPURGE);
-        f_msg.m_databaseid = MyDatabaseId;
-        f_msg.m_nentries = 0;
-
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        switch (ent->key.type)
         {
-            Oid            funcid = funcentry->functionid;
-
-            CHECK_FOR_INTERRUPTS();
-
-            if (hash_search(htab, (void *) &funcid, HASH_FIND, NULL) != NULL)
+            case PGSTAT_TYPE_DB:
+                /*
+                 * don't remove database entry for shared tables and existent
+                 * tables
+                 */
+                if (ent->key.databaseid == 0 ||
+                    pgstat_oid_lookup(dbids, ent->key.databaseid) != NULL)
+                    continue;
+
+                break;
+
+            case PGSTAT_TYPE_TABLE:
+                /* don't remove existent relations */
+                if (pgstat_oid_lookup(relids, ent->key.objectid) != NULL)
+                    continue;
+
+                break;
+
+            case PGSTAT_TYPE_FUNCTION:
+                /* don't remove existent functions  */
+                if (pgstat_oid_lookup(funcids, ent->key.objectid) != NULL)
+                    continue;
+
+                break;
+
+            case PGSTAT_TYPE_REPLSLOT:
+                /*
+                 * We don't bother vacuumming this kind of entries because the
+                 * number of entries is quite small and entries are likely to
+                 * be reused soon.
+                 */
                 continue;
-
-            /*
-             * Not there, so add this function's Oid to the message
-             */
-            f_msg.m_functionid[f_msg.m_nentries++] = funcid;
-
-            /*
-             * If the message is full, send it out and reinitialize to empty
-             */
-            if (f_msg.m_nentries >= PGSTAT_NUM_FUNCPURGE)
-            {
-                len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                    + f_msg.m_nentries * sizeof(Oid);
-
-                pgstat_send(&f_msg, len);
-
-                f_msg.m_nentries = 0;
-            }
-        }
-
-        /*
-         * Send the rest
-         */
-        if (f_msg.m_nentries > 0)
-        {
-            len = offsetof(PgStat_MsgFuncpurge, m_functionid[0])
-                + f_msg.m_nentries * sizeof(Oid);
-
-            pgstat_send(&f_msg, len);
         }
 
-        hash_destroy(htab);
+        /* drop this etnry */
+        delete_current_stats_entry(&dshstat);
+        nvictims++;
     }
+    dshash_seq_term(&dshstat);
+    pgstat_oid_destroy(dbids);
+    pgstat_oid_destroy(relids);
+    pgstat_oid_destroy(funcids);
+
+    if (nvictims > 0)
+        pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
- * pgstat_collect_oids() -
+ * collect_oids() -
  *
  *    Collect the OIDs of all objects listed in the specified system catalog
- *    into a temporary hash table.  Caller should hash_destroy the result
+ *    into a temporary hash table.  Caller should pgsstat_oid_destroy the result
  *    when done with it.  (However, we make the table in CurrentMemoryContext
  *    so that it will be freed properly in event of an error.)
  * ----------
  */
-static HTAB *
-pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
+static pgstat_oid_hash *
+collect_oids(Oid catalogid, AttrNumber anum_oid)
 {
-    HTAB       *htab;
-    HASHCTL        hash_ctl;
+    pgstat_oid_hash *rethash;
     Relation    rel;
     TableScanDesc scan;
     HeapTuple    tup;
     Snapshot    snapshot;
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(Oid);
-    hash_ctl.hcxt = CurrentMemoryContext;
-    htab = hash_create("Temporary table of OIDs",
-                       PGSTAT_TAB_HASH_SIZE,
-                       &hash_ctl,
-                       HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    rethash = pgstat_oid_create(CurrentMemoryContext,
+                                PGSTAT_TABLE_HASH_SIZE, NULL);
 
     rel = table_open(catalogid, AccessShareLock);
     snapshot = RegisterSnapshot(GetLatestSnapshot());
@@ -1294,123 +1694,60 @@ pgstat_collect_oids(Oid catalogid, AttrNumber anum_oid)
     {
         Oid            thisoid;
         bool        isnull;
+        bool        found;
 
         thisoid = heap_getattr(tup, anum_oid, RelationGetDescr(rel), &isnull);
         Assert(!isnull);
 
         CHECK_FOR_INTERRUPTS();
 
-        (void) hash_search(htab, (void *) &thisoid, HASH_ENTER, NULL);
+        pgstat_oid_insert(rethash, thisoid, &found);
     }
     table_endscan(scan);
     UnregisterSnapshot(snapshot);
     table_close(rel, AccessShareLock);
 
-    return htab;
+    return rethash;
 }
 
-
 /* ----------
  * pgstat_drop_database() -
  *
- *    Tell the collector that we just dropped a database.
- *    (If the message gets lost, we will still clean the dead DB eventually
- *    via future invocations of pgstat_vacuum_stat().)
- * ----------
+ *    Remove entry for the database that we just dropped.
+ *
+ *  Some entries might be left alone due to lock failure or some stats are
+ *    flushed after this but we will still clean the dead DB eventually via
+ *    future invocations of pgstat_vacuum_stat().
+ *    ----------
  */
 void
 pgstat_drop_database(Oid databaseid)
 {
-    PgStat_MsgDropdb msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DROPDB);
-    msg.m_databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/* ----------
- * pgstat_drop_relation() -
- *
- *    Tell the collector that we just dropped a relation.
- *    (If the message gets lost, we will still clean the dead entry eventually
- *    via future invocations of pgstat_vacuum_stat().)
- *
- *    Currently not used for lack of any good place to call it; we rely
- *    entirely on pgstat_vacuum_stat() to clean out stats for dead rels.
- * ----------
- */
-#ifdef NOT_USED
-void
-pgstat_drop_relation(Oid relid)
-{
-    PgStat_MsgTabpurge msg;
-    int            len;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
-
-    msg.m_tableid[0] = relid;
-    msg.m_nentries = 1;
-
-    len = offsetof(PgStat_MsgTabpurge, m_tableid[0]) + sizeof(Oid);
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, len);
-}
-#endif                            /* NOT_USED */
-
-
-/* ----------
- * pgstat_send_connstats() -
- *
- *    Tell the collector about session statistics.
- *    The parameter "disconnect" will be true when the backend exits.
- *    "last_report" is the last time we were called (0 if never).
- * ----------
- */
-static void
-pgstat_send_connstats(bool disconnect, TimestampTz last_report)
-{
-    PgStat_MsgConn msg;
-    long        secs;
-    int            usecs;
+    Assert(OidIsValid(databaseid));
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CONNECTION);
-    msg.m_databaseid = MyDatabaseId;
-
-    /* session time since the last report */
-    TimestampDifference(((last_report == 0) ? MyStartTimestamp : last_report),
-                        GetCurrentTimestamp(),
-                        &secs, &usecs);
-    msg.m_session_time = secs * 1000000 + usecs;
-
-    msg.m_disconnect = disconnect ? pgStatSessionEndCause : DISCONNECT_NOT_YET;
-
-    msg.m_active_time = pgStatActiveTime;
-    pgStatActiveTime = 0;
-
-    msg.m_idle_in_xact_time = pgStatTransactionIdleTime;
-    pgStatTransactionIdleTime = 0;
-
-    /* report a new session only the first time */
-    msg.m_count = (last_report == 0) ? 1 : 0;
-
-    pgstat_send(&msg, sizeof(PgStat_MsgConn));
+    /* some of the dshash entries are to be removed, take exclusive lock. */
+    dshash_seq_init(&hstat, pgStatSharedHash, true);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        if (p->key.databaseid == MyDatabaseId)
+            delete_current_stats_entry(&hstat);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Let readers run a garbage collection of local hashes */
+    pg_atomic_add_fetch_u64(&StatsShmem->gc_count, 1);
 }
 
-
 /* ----------
  * pgstat_reset_counters() -
  *
- *    Tell the statistics collector to reset counters for our database.
+ *    Reset counters for our database.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1419,53 +1756,148 @@ pgstat_send_connstats(bool disconnect, TimestampTz last_report)
 void
 pgstat_reset_counters(void)
 {
-    PgStat_MsgResetcounter msg;
+    dshash_seq_status hstat;
+    PgStatHashEntry *p;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    /* dshash entry is not modified, take shared lock */
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
+        PgStat_StatEntryHeader *header;
+
+        if (p->key.databaseid != MyDatabaseId)
+            continue;
+
+        header = dsa_get_address(area, p->body);
+
+        LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+        memset(PGSTAT_SHENT_BODY(header), 0,
+               PGSTAT_SHENT_BODY_LEN(p->key.type));
+
+        if (p->key.type == PGSTAT_TYPE_DB)
+        {
+            PgStat_StatDBEntry *dbstat = (PgStat_StatDBEntry *) header;
+
+            dbstat->stat_reset_timestamp = GetCurrentTimestamp();
+        }
+        LWLockRelease(&header->lock);
+    }
+    dshash_seq_term(&hstat);
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+}
+
+/*
+ * pgstat_copy_global_stats - helper function for functions
+ *           pgstat_fetch_stat_*() and pgstat_reset_shared_counters().
+ *
+ * Copies out the specified memory area following change-count protocol.
+ */
+static inline void
+pgstat_copy_global_stats(void *dst, void *src, size_t len,
+                         pg_atomic_uint32 *count)
+{
+    int            before_changecount;
+    int            after_changecount;
+
+    after_changecount = pg_atomic_read_u32(count);
+
+    do
+    {
+        before_changecount = after_changecount;
+        memcpy(dst, src, len);
+        after_changecount = pg_atomic_read_u32(count);
+    } while ((before_changecount & 1) == 1 ||
+             after_changecount != before_changecount);
 }
 
 /* ----------
  * pgstat_reset_shared_counters() -
  *
- *    Tell the statistics collector to reset cluster-wide shared counters.
+ *    Reset cluster-wide shared counters.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
+ *
+ *  We don't scribble on shared stats while resetting to avoid locking on
+ *  shared stats struct. Instead, just record the current counters in another
+ *  shared struct, which is protected by StatsLock. See
+ *  pgstat_fetch_stat_(archiver|bgwriter|checkpointer) for the reader side.
  * ----------
  */
 void
 pgstat_reset_shared_counters(const char *target)
 {
-    PgStat_MsgResetsharedcounter msg;
-
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    TimestampTz now = GetCurrentTimestamp();
+    PgStat_Shared_Reset_Target t;
 
     if (strcmp(target, "archiver") == 0)
-        msg.m_resettarget = RESET_ARCHIVER;
+        t = RESET_ARCHIVER;
     else if (strcmp(target, "bgwriter") == 0)
-        msg.m_resettarget = RESET_BGWRITER;
+        t = RESET_BGWRITER;
     else if (strcmp(target, "wal") == 0)
-        msg.m_resettarget = RESET_WAL;
+        t = RESET_WAL;
     else
         ereport(ERROR,
                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
                  errmsg("unrecognized reset target: \"%s\"", target),
                  errhint("Target must be \"archiver\", \"bgwriter\" or \"wal\".")));
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSHAREDCOUNTER);
-    pgstat_send(&msg, sizeof(msg));
+    /* Reset statistics for the cluster. */
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+
+    switch (t)
+    {
+        case RESET_ARCHIVER:
+            pgstat_copy_global_stats(&StatsShmem->archiver_reset_offset,
+                                     &StatsShmem->archiver_stats,
+                                     sizeof(PgStat_Archiver),
+                                     &StatsShmem->archiver_changecount);
+            StatsShmem->archiver_reset_offset.stat_reset_timestamp = now;
+            cached_archiverstats_is_valid = false;
+            break;
+
+        case RESET_BGWRITER:
+            pgstat_copy_global_stats(&StatsShmem->bgwriter_reset_offset,
+                                     &StatsShmem->bgwriter_stats,
+                                     sizeof(PgStat_BgWriter),
+                                     &StatsShmem->bgwriter_changecount);
+            pgstat_copy_global_stats(&StatsShmem->checkpointer_reset_offset,
+                                     &StatsShmem->checkpointer_stats,
+                                     sizeof(PgStat_CheckPointer),
+                                     &StatsShmem->checkpointer_changecount);
+            StatsShmem->bgwriter_reset_offset.stat_reset_timestamp = now;
+            cached_bgwriterstats_is_valid = false;
+            cached_checkpointerstats_is_valid = false;
+            break;
+
+        case RESET_WAL:
+
+            /*
+             * Differntly from the two above, WAL statistics has many writer
+             * processes and protected by wal_stats_lock.
+             */
+            LWLockAcquire(&StatsShmem->wal_stats_lock, LW_EXCLUSIVE);
+            MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+            StatsShmem->wal_stats.stat_reset_timestamp = now;
+            LWLockRelease(&StatsShmem->wal_stats_lock);
+            cached_walstats_is_valid = false;
+            break;
+    }
+
+    LWLockRelease(StatsLock);
 }
 
 /* ----------
  * pgstat_reset_single_counter() -
  *
- *    Tell the statistics collector to reset a single counter.
+ *    Reset a single counter.
  *
  *    Permission checking for this function is managed through the normal
  *    GRANT system.
@@ -1474,17 +1906,38 @@ pgstat_reset_shared_counters(const char *target)
 void
 pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 {
-    PgStat_MsgResetsinglecounter msg;
+    PgStat_StatEntryHeader *header;
+    PgStat_StatDBEntry *dbentry;
+    PgStatTypes stattype = PGSTAT_TYPE_TABLE;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, MyDatabaseId, InvalidOid, false, false,
+                       NULL);
+    Assert(dbentry);
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSINGLECOUNTER);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_resettype = type;
-    msg.m_objectid = objoid;
+    /* Set the reset timestamp for the whole database */
+    ts = GetCurrentTimestamp();
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->stat_reset_timestamp = ts;
+    LWLockRelease(&dbentry->header.lock);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* Remove object if it exists, ignore if not */
+    switch (type)
+    {
+        case RESET_TABLE:
+            stattype = PGSTAT_TYPE_TABLE;
+            break;
+        case RESET_FUNCTION:
+            stattype = PGSTAT_TYPE_FUNCTION;
+            break;
+    }
+
+    header = get_stat_entry(stattype, MyDatabaseId, objoid, false, false, NULL);
+
+    LWLockAcquire(&header->lock, LW_EXCLUSIVE);
+    memset(PGSTAT_SHENT_BODY(header), 0, PGSTAT_SHENT_BODY_LEN(stattype));
+    LWLockRelease(&header->lock);
 }
 
 /* ----------
@@ -1500,15 +1953,40 @@ pgstat_reset_single_counter(Oid objoid, PgStat_Single_Reset_Type type)
 void
 pgstat_reset_slru_counter(const char *name)
 {
-    PgStat_MsgResetslrucounter msg;
+    int            i;
+    TimestampTz ts = GetCurrentTimestamp();
+    uint32        assert_changecount PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
+    if (name)
+    {
+        i = pgstat_slru_index(name);
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        MemSet(&StatsShmem->slru_stats.entry[i], 0,
+               sizeof(PgStat_SLRUStats));
+        StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
+    else
+    {
+        LWLockAcquire(&StatsShmem->slru_stats.lock, LW_EXCLUSIVE);
+        assert_changecount =
+            pg_atomic_fetch_add_u32(&StatsShmem->slru_changecount, 1);
+        Assert((assert_changecount & 1) == 0);
+        for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
+        {
+            MemSet(&StatsShmem->slru_stats.entry[i], 0,
+                   sizeof(PgStat_SLRUStats));
+            StatsShmem->slru_stats.entry[i].stat_reset_timestamp = ts;
+        }
+        pg_atomic_add_fetch_u32(&StatsShmem->slru_changecount, 1);
+        LWLockRelease(&StatsShmem->slru_stats.lock);
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETSLRUCOUNTER);
-    msg.m_index = (name) ? pgstat_slru_index(name) : -1;
-
-    pgstat_send(&msg, sizeof(msg));
+    cached_slrustats_is_valid = false;
 }
 
 /* ----------
@@ -1524,20 +2002,19 @@ pgstat_reset_slru_counter(const char *name)
 void
 pgstat_reset_replslot_counter(const char *name)
 {
-    PgStat_MsgResetreplslotcounter msg;
+    int            startidx;
+    int            endidx;
+    int            i;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    if (!IsUnderPostmaster || !pgStatSharedHash)
         return;
 
     if (name)
     {
         ReplicationSlot *slot;
 
-        /*
-         * Check if the slot exits with the given name. It is possible that by
-         * the time this message is executed the slot is dropped but at least
-         * this check will ensure that the given name is for a valid slot.
-         */
+        /* Check if the slot exits with the given name. */
         LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
         slot = SearchNamedReplicationSlot(name);
         LWLockRelease(ReplicationSlotControlLock);
@@ -1555,15 +2032,36 @@ pgstat_reset_replslot_counter(const char *name)
         if (SlotIsPhysical(slot))
             return;
 
-        strlcpy(msg.m_slotname, name, NAMEDATALEN);
-        msg.clearall = false;
+        /* reset this one entry */
+        startidx = endidx = slot - ReplicationSlotCtl->replication_slots;
     }
     else
-        msg.clearall = true;
+    {
+        /* reset all existent entries */
+        startidx = 0;
+        endidx = max_replication_slots - 1;
+    }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+    ts = GetCurrentTimestamp();
+    for (i = startidx; i <= endidx; i++)
+    {
+        PgStat_ReplSlot *shent;
 
-    pgstat_send(&msg, sizeof(msg));
+        shent = (PgStat_ReplSlot *)
+            get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                           MyDatabaseId, i, false, false, NULL);
+
+        /* Skip non-existent entries */
+        if (!shent)
+            continue;
+
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        memset(&shent->spill_txns, 0,
+               offsetof(PgStat_ReplSlot, stat_reset_timestamp) -
+               offsetof(PgStat_ReplSlot, spill_txns));
+        shent->stat_reset_timestamp = ts;
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
@@ -1577,48 +2075,94 @@ pgstat_reset_replslot_counter(const char *name)
 void
 pgstat_report_autovac(Oid dboid)
 {
-    PgStat_MsgAutovacStart msg;
+    PgStat_StatDBEntry *dbentry;
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    /* return if activity stats is not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-    msg.m_databaseid = dboid;
-    msg.m_start_time = GetCurrentTimestamp();
+    /*
+     * End-of-vacuum is reported instantly. Report the start the same way for
+     * consistency. Vacuum doesn't run frequently and is a long-lasting
+     * operation so it doesn't matter if we get blocked here a little.
+     */
+    dbentry = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dboid, InvalidOid, false, true, NULL);
 
-    pgstat_send(&msg, sizeof(msg));
+    ts = GetCurrentTimestamp();;
+    LWLockAcquire(&dbentry->header.lock, LW_EXCLUSIVE);
+    dbentry->last_autovac_time = ts;
+    LWLockRelease(&dbentry->header.lock);
 }
 
 
 /* ---------
  * pgstat_report_vacuum() -
  *
- *    Tell the collector about the table we just vacuumed.
+ *    Report about the table we just vacuumed.
  * ---------
  */
 void
 pgstat_report_vacuum(Oid tableoid, bool shared,
                      PgStat_Counter livetuples, PgStat_Counter deadtuples)
 {
-    PgStat_MsgVacuum msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+    TimestampTz ts;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_VACUUM);
-    msg.m_databaseid = shared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = tableoid;
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_vacuumtime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /* Store the data in the table's hash table entry. */
+    ts = GetCurrentTimestamp();
+
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter.
+     * Furthermore, this can prevent the stats updates made by the
+     * transactions that ends after this vacuum from being canceled by a
+     * delayed vacuum report. Update shared stats entry directly for the above
+     * reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, tableoid, false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * It is quite possible that a non-aggressive VACUUM ended up skipping
+     * various pages, however, we'll zero the insert counter here regardless.
+     * It's currently used only to track when we need to perform an "insert"
+     * autovacuum, which are mainly intended to freeze newly inserted tuples.
+     * Zeroing this may just mean we'll not try to vacuum the table again
+     * until enough tuples have been inserted to trigger another insert
+     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
+     * stragglers.
+     */
+    tabentry->inserts_since_vacuum = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_vacuum_timestamp = ts;
+        tabentry->autovac_vacuum_count++;
+    }
+    else
+    {
+        tabentry->vacuum_timestamp = ts;
+        tabentry->vacuum_count++;
+    }
+
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_analyze() -
  *
- *    Tell the collector about the table we just analyzed.
+ *    Report about the table we just analyzed.
  *
  * Caller must provide new live- and dead-tuples estimates, as well as a
  * flag indicating whether to reset the changes_since_analyze counter.
@@ -1629,9 +2173,11 @@ pgstat_report_analyze(Relation rel,
                       PgStat_Counter livetuples, PgStat_Counter deadtuples,
                       bool resetcounter)
 {
-    PgStat_MsgAnalyze msg;
+    PgStat_StatTabEntry *tabentry;
+    Oid            dboid = (rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId);
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
     /*
@@ -1639,10 +2185,10 @@ pgstat_report_analyze(Relation rel,
      * already inserted and/or deleted rows in the target table. ANALYZE will
      * have counted such rows as live or dead respectively. Because we will
      * report our counts of such rows at transaction end, we should subtract
-     * off these counts from what we send to the collector now, else they'll
-     * be double-counted after commit.  (This approach also ensures that the
-     * collector ends up with the right numbers if we abort instead of
-     * committing.)
+     * off these counts from what is already written to shared stats now, else
+     * they'll be double-counted after commit.  (This approach also ensures
+     * that the shared stats ends up with the right numbers if we abort
+     * instead of committing.)
      */
     if (rel->pgstat_info != NULL)
     {
@@ -1660,137 +2206,224 @@ pgstat_report_analyze(Relation rel,
         deadtuples = Max(deadtuples, 0);
     }
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ANALYZE);
-    msg.m_databaseid = rel->rd_rel->relisshared ? InvalidOid : MyDatabaseId;
-    msg.m_tableoid = RelationGetRelid(rel);
-    msg.m_autovacuum = IsAutoVacuumWorkerProcess();
-    msg.m_resetcounter = resetcounter;
-    msg.m_analyzetime = GetCurrentTimestamp();
-    msg.m_live_tuples = livetuples;
-    msg.m_dead_tuples = deadtuples;
-    pgstat_send(&msg, sizeof(msg));
+    /*
+     * Differently from ordinary operations, maintenance commands take longer
+     * time and getting blocked at the end of work doesn't matter.
+     * Furthermore, this can prevent the stats updates made by the
+     * transactions that ends after this analyze from being canceled by a
+     * delayed analyze report. Update shared stats entry directly for the
+     * above reasons.
+     */
+    tabentry = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, RelationGetRelid(rel),
+                       false, true, NULL);
+
+    LWLockAcquire(&tabentry->header.lock, LW_EXCLUSIVE);
+    tabentry->n_live_tuples = livetuples;
+    tabentry->n_dead_tuples = deadtuples;
+
+    /*
+     * If commanded, reset changes_since_analyze to zero.  This forgets any
+     * changes that were committed while the ANALYZE was in progress, but we
+     * have no good way to estimate how many of those there were.
+     */
+    if (resetcounter)
+        tabentry->changes_since_analyze = 0;
+
+    if (IsAutoVacuumWorkerProcess())
+    {
+        tabentry->autovac_analyze_timestamp = GetCurrentTimestamp();
+        tabentry->autovac_analyze_count++;
+    }
+    else
+    {
+        tabentry->analyze_timestamp = GetCurrentTimestamp();
+        tabentry->analyze_count++;
+    }
+    LWLockRelease(&tabentry->header.lock);
 }
 
 /* --------
  * pgstat_report_recovery_conflict() -
  *
- *    Tell the collector about a Hot Standby recovery conflict.
+ *    Report a Hot Standby recovery conflict.
  * --------
  */
 void
 pgstat_report_recovery_conflict(int reason)
 {
-    PgStat_MsgRecoveryConflict msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RECOVERYCONFLICT);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_reason = reason;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+
+    switch (reason)
+    {
+        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
+
+            /*
+             * Since we drop the information about the database as soon as it
+             * replicates, there is no point in counting these conflicts.
+             */
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
+            dbent->counts.n_conflict_tablespace++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_LOCK:
+            dbent->counts.n_conflict_lock++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
+            dbent->counts.n_conflict_snapshot++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
+            dbent->counts.n_conflict_bufferpin++;
+            break;
+        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
+            dbent->counts.n_conflict_startup_deadlock++;
+            break;
+    }
 }
 
+
 /* --------
  * pgstat_report_deadlock() -
  *
- *    Tell the collector about a deadlock detected.
+ *    Report a deadlock detected.
  * --------
  */
 void
 pgstat_report_deadlock(void)
 {
-    PgStat_MsgDeadlock msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DEADLOCK);
-    msg.m_databaseid = MyDatabaseId;
-    pgstat_send(&msg, sizeof(msg));
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_deadlocks++;
 }
 
-
-
 /* --------
- * pgstat_report_checksum_failures_in_db() -
+ * pgstat_report_checksum_failures_in_db(dboid, failure_count) -
  *
- *    Tell the collector about one or more checksum failures.
+ *    Reports about one or more checksum failures.
  * --------
  */
 void
 pgstat_report_checksum_failures_in_db(Oid dboid, int failurecount)
 {
-    PgStat_MsgChecksumFailure msg;
+    PgStat_StatDBEntry *dbentry;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not active */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_CHECKSUMFAILURE);
-    msg.m_databaseid = dboid;
-    msg.m_failurecount = failurecount;
-    msg.m_failure_time = GetCurrentTimestamp();
+    dbentry = get_local_dbstat_entry(dboid);
 
-    pgstat_send(&msg, sizeof(msg));
+    /* add accumulated count to the parameter */
+    dbentry->counts.n_checksum_failures += failurecount;
 }
 
 /* --------
  * pgstat_report_checksum_failure() -
  *
- *    Tell the collector about a checksum failure.
+ *    Reports about a checksum failure.
  * --------
  */
 void
 pgstat_report_checksum_failure(void)
 {
-    pgstat_report_checksum_failures_in_db(MyDatabaseId, 1);
+    PgStat_StatDBEntry *dbent;
+
+    /* return if we are not collecting stats */
+    if (!area)
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_checksum_failures++;
 }
 
 /* --------
  * pgstat_report_tempfile() -
  *
- *    Tell the collector about a temporary file.
+ *    Report a temporary file.
  * --------
  */
 void
 pgstat_report_tempfile(size_t filesize)
 {
-    PgStat_MsgTempFile msg;
+    PgStat_StatDBEntry *dbent;
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TEMPFILE);
-    msg.m_databaseid = MyDatabaseId;
-    msg.m_filesize = filesize;
-    pgstat_send(&msg, sizeof(msg));
+    if (filesize == 0)            /* Is there a case where filesize is really 0? */
+        return;
+
+    dbent = get_local_dbstat_entry(MyDatabaseId);
+    dbent->counts.n_temp_bytes += filesize; /* needs check overflow */
+    dbent->counts.n_temp_files++;
 }
 
 /* ----------
  * pgstat_report_replslot() -
  *
- *    Tell the collector about replication slot statistics.
+ *    Report replication slot activity.
  * ----------
  */
 void
-pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
-                       int spillbytes, int streamtxns, int streamcount, int streambytes)
+pgstat_report_replslot(const char *slotname,
+                       int spilltxns, int spillcount, int spillbytes,
+                       int streamtxns, int streamcount, int streambytes)
 {
-    PgStat_MsgReplSlot msg;
+    PgStat_ReplSlot *shent;
+    int            i;
+    bool        found;
+
+    if (!area)
+        return;
+
+    for (i = 0; i < max_replication_slots; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
+
+    }
 
     /*
-     * Prepare and send the message
+     * the slot should have been removed. just ignore it.  We create the entry
+     * for the slot with this name next time.
      */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = false;
-    msg.m_spill_txns = spilltxns;
-    msg.m_spill_count = spillcount;
-    msg.m_spill_bytes = spillbytes;
-    msg.m_stream_txns = streamtxns;
-    msg.m_stream_count = streamcount;
-    msg.m_stream_bytes = streambytes;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
+    if (i == max_replication_slots)
+        return;
+
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, true, &found);
+
+    /* Clear the counters and reset if it is not used */
+    LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+    if (shent->slotname[0] == '\0' || !found)
+    {
+        Assert(!shent->header.dropped);
+        memset(&shent->spill_txns, 0,
+               sizeof(PgStat_ReplSlot) - offsetof(PgStat_ReplSlot, spill_txns));
+        strlcpy(shent->slotname, slotname, NAMEDATALEN);
+    }
+
+    shent->spill_txns += spilltxns;
+    shent->spill_count += spillcount;
+    shent->spill_bytes += spillbytes;
+    shent->stream_txns += streamtxns;
+    shent->stream_count += streamcount;
+    shent->stream_bytes += streambytes;
+    LWLockRelease(&shent->header.lock);
 }
 
 /* ----------
@@ -1802,55 +2435,47 @@ pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
 void
 pgstat_report_replslot_drop(const char *slotname)
 {
-    PgStat_MsgReplSlot msg;
+    int            i;
+    PgStat_ReplSlot *shent;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_REPLSLOT);
-    strlcpy(msg.m_slotname, slotname, NAMEDATALEN);
-    msg.m_drop = true;
-    pgstat_send(&msg, sizeof(PgStat_MsgReplSlot));
-}
+    Assert(area);
+    if (!area)
+        return;
 
-/* ----------
- * pgstat_ping() -
- *
- *    Send some junk data to the collector to increase traffic.
- * ----------
- */
-void
-pgstat_ping(void)
-{
-    PgStat_MsgDummy msg;
+    for (i = 0; i < max_replication_slots; i++)
+    {
+        if (strcmp(NameStr(ReplicationSlotCtl->replication_slots[i].data.name),
+                   slotname) == 0)
+            break;
 
-    if (pgStatSock == PGINVALID_SOCKET)
+    }
+
+    /* XXX: maybe the slot has been removed. just ignore it. */
+    if (i == max_replication_slots)
         return;
 
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_DUMMY);
-    pgstat_send(&msg, sizeof(msg));
+    shent = (PgStat_ReplSlot *)
+        get_stat_entry(PGSTAT_TYPE_REPLSLOT,
+                       MyDatabaseId, i, false, false, NULL);
+
+    /*
+     * Mark this entry as "not used". We don't "drop" this entry because no other process can't be looking this entry
+     */
+    if (shent && shent->slotname[0] != '\0')
+    {
+        LWLockAcquire(&shent->header.lock, LW_EXCLUSIVE);
+        shent->slotname[0] = '\0';
+        LWLockRelease(&shent->header.lock);
+    }
 }
 
 /* ----------
- * pgstat_send_inquiry() -
+ * pgstat_init_function_usage() -
  *
- *    Notify collector that we need fresh data.
+ *  Initialize function call usage data.
+ *  Called by the executor before invoking a function.
  * ----------
  */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
-
-/*
- * Initialize function call usage data.
- * Called by the executor before invoking a function.
- */
 void
 pgstat_init_function_usage(FunctionCallInfo fcinfo,
                            PgStat_FunctionCallUsage *fcu)
@@ -1865,24 +2490,9 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
         return;
     }
 
-    if (!pgStatFunctions)
-    {
-        /* First time through - initialize function stat table */
-        HASHCTL        hash_ctl;
-
-        hash_ctl.keysize = sizeof(Oid);
-        hash_ctl.entrysize = sizeof(PgStat_BackendFunctionEntry);
-        pgStatFunctions = hash_create("Function stat entries",
-                                      PGSTAT_FUNCTION_HASH_SIZE,
-                                      &hash_ctl,
-                                      HASH_ELEM | HASH_BLOBS);
-    }
-
-    /* Get the stats entry for this function, create if necessary */
-    htabent = hash_search(pgStatFunctions, &fcinfo->flinfo->fn_oid,
-                          HASH_ENTER, &found);
-    if (!found)
-        MemSet(&htabent->f_counts, 0, sizeof(PgStat_FunctionCounts));
+    htabent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             fcinfo->flinfo->fn_oid, true, &found);
 
     fcu->fs = &htabent->f_counts;
 
@@ -1896,31 +2506,37 @@ pgstat_init_function_usage(FunctionCallInfo fcinfo,
     INSTR_TIME_SET_CURRENT(fcu->f_start);
 }
 
-/*
- * find_funcstat_entry - find any existing PgStat_BackendFunctionEntry entry
- *        for specified function
+/* ----------
+ * find_funcstat_entry() -
  *
- * If no entry, return NULL, don't create a new one
+ *  find any existing PgStat_BackendFunctionEntry entry for specified function
+ *
+ *  If no entry, return NULL, not creating a new one.
+ * ----------
  */
 PgStat_BackendFunctionEntry *
 find_funcstat_entry(Oid func_id)
 {
-    if (pgStatFunctions == NULL)
-        return NULL;
+    PgStat_BackendFunctionEntry *ent;
 
-    return (PgStat_BackendFunctionEntry *) hash_search(pgStatFunctions,
-                                                       (void *) &func_id,
-                                                       HASH_FIND, NULL);
+    ent = (PgStat_BackendFunctionEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId,
+                             func_id, false, NULL);
+
+    return ent;
 }
 
-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------
  */
 void
 pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
@@ -1961,9 +2577,6 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
         fs->f_numcalls++;
     fs->f_total_time = f_total;
     INSTR_TIME_ADD(fs->f_self_time, f_self);
-
-    /* indicate that we have something to send */
-    have_function_stats = true;
 }
 
 
@@ -1975,8 +2588,7 @@ pgstat_end_function_usage(PgStat_FunctionCallUsage *fcu, bool finalize)
  *
  *    We assume that a relcache entry's pgstat_info field is zeroed by
  *    relcache.c when the relcache entry is made; thereafter it is long-lived
- *    data.  We can avoid repeated searches of the TabStatus arrays when the
- *    same relation is touched repeatedly within a transaction.
+ *    data.
  * ----------
  */
 void
@@ -1992,7 +2604,8 @@ pgstat_initstats(Relation rel)
         return;
     }
 
-    if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)
+    /* return if we are not collecting stats */
+    if (!area)
     {
         /* We're not counting at all */
         rel->pgstat_info = NULL;
@@ -2003,120 +2616,60 @@ pgstat_initstats(Relation rel)
      * If we already set up this relation in the current transaction, nothing
      * to do.
      */
-    if (rel->pgstat_info != NULL &&
-        rel->pgstat_info->t_id == rel_id)
+    if (rel->pgstat_info != NULL)
         return;
 
     /* Else find or make the PgStat_TableStatus entry, and update link */
-    rel->pgstat_info = get_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    rel->pgstat_info = get_local_tabstat_entry(rel_id, rel->rd_rel->relisshared);
+    /* mark this relation as the owner */
+
+    /* don't allow link a stats to multiple relcache entries */
+    Assert(rel->pgstat_info->relation == NULL);
+    rel->pgstat_info->relation = rel;
 }
 
 /*
- * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
+ * pgstat_delinkstats() -
+ *
+ *  Break the mutual link between a relcache entry and a local stats entry.
+ *  This must be called always when one end of the link is removed.
  */
-static PgStat_TableStatus *
-get_tabstat_entry(Oid rel_id, bool isshared)
+void
+pgstat_delinkstats(Relation rel)
 {
-    TabStatHashEntry *hash_entry;
-    PgStat_TableStatus *entry;
-    TabStatusArray *tsa;
-    bool        found;
-
-    /*
-     * Create hash table if we don't have it already.
-     */
-    if (pgStatTabHash == NULL)
+    /* remove the link to stats info if any */
+    if (rel && rel->pgstat_info)
     {
-        HASHCTL        ctl;
-
-        ctl.keysize = sizeof(Oid);
-        ctl.entrysize = sizeof(TabStatHashEntry);
-
-        pgStatTabHash = hash_create("pgstat TabStatusArray lookup hash table",
-                                    TABSTAT_QUANTUM,
-                                    &ctl,
-                                    HASH_ELEM | HASH_BLOBS);
+        /* ilnk sanity check */
+        Assert(rel->pgstat_info->relation == rel);
+        rel->pgstat_info->relation = NULL;
+        rel->pgstat_info = NULL;
     }
-
-    /*
-     * Find an entry or create a new one.
-     */
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_ENTER, &found);
-    if (!found)
-    {
-        /* initialize new entry with null pointer */
-        hash_entry->tsa_entry = NULL;
-    }
-
-    /*
-     * If entry is already valid, we're done.
-     */
-    if (hash_entry->tsa_entry)
-        return hash_entry->tsa_entry;
-
-    /*
-     * Locate the first pgStatTabList entry with free space, making a new list
-     * entry if needed.  Note that we could get an OOM failure here, but if so
-     * we have left the hashtable and the list in a consistent state.
-     */
-    if (pgStatTabList == NULL)
-    {
-        /* Set up first pgStatTabList entry */
-        pgStatTabList = (TabStatusArray *)
-            MemoryContextAllocZero(TopMemoryContext,
-                                   sizeof(TabStatusArray));
-    }
-
-    tsa = pgStatTabList;
-    while (tsa->tsa_used >= TABSTAT_QUANTUM)
-    {
-        if (tsa->tsa_next == NULL)
-            tsa->tsa_next = (TabStatusArray *)
-                MemoryContextAllocZero(TopMemoryContext,
-                                       sizeof(TabStatusArray));
-        tsa = tsa->tsa_next;
-    }
-
-    /*
-     * Allocate a PgStat_TableStatus entry within this list entry.  We assume
-     * the entry was already zeroed, either at creation or after last use.
-     */
-    entry = &tsa->tsa_entries[tsa->tsa_used++];
-    entry->t_id = rel_id;
-    entry->t_shared = isshared;
-
-    /*
-     * Now we can fill the entry in pgStatTabHash.
-     */
-    hash_entry->tsa_entry = entry;
-
-    return entry;
 }
 
 /*
  * find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
  *
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel_id in the current
+ *  database. If not found, try finding from shared tables.
  *
- * Note: if we got an error in the most recent execution of pgstat_report_stat,
- * it's possible that an entry exists but there's no hashtable entry for it.
- * That's okay, we'll treat this case as "doesn't exist".
+ *  If no entry found, return NULL, don't create a new one
+ * ----------
  */
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-    TabStatHashEntry *hash_entry;
+    PgStat_TableStatus *ent;
 
-    /* If hashtable doesn't exist, there are no entries at all */
-    if (!pgStatTabHash)
-        return NULL;
+    ent = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE, MyDatabaseId, rel_id,
+                             false, NULL);
+    if (!ent)
+        ent = (PgStat_TableStatus *)
+            get_local_stat_entry(PGSTAT_TYPE_TABLE, InvalidOid, rel_id,
+                                 false, NULL);
 
-    hash_entry = hash_search(pgStatTabHash, &rel_id, HASH_FIND, NULL);
-    if (!hash_entry)
-        return NULL;
-
-    /* Note that this step could also return NULL, but that's correct */
-    return hash_entry->tsa_entry;
+    return ent;
 }
 
 /*
@@ -2517,7 +3070,7 @@ AtPrepare_PgStat(void)
         Assert(xact_state->prev == NULL);
         for (trans = xact_state->first; trans != NULL; trans = trans->next)
         {
-            PgStat_TableStatus *tabstat;
+            PgStat_TableStatus *tabstat PG_USED_FOR_ASSERTS_ONLY;
             TwoPhasePgStatRecord record;
 
             Assert(trans->nest_level == 1);
@@ -2531,8 +3084,6 @@ AtPrepare_PgStat(void)
             record.inserted_pre_trunc = trans->inserted_pre_trunc;
             record.updated_pre_trunc = trans->updated_pre_trunc;
             record.deleted_pre_trunc = trans->deleted_pre_trunc;
-            record.t_id = tabstat->t_id;
-            record.t_shared = tabstat->t_shared;
             record.t_truncated = trans->truncated;
 
             RegisterTwoPhaseRecord(TWOPHASE_RM_PGSTAT_ID, 0,
@@ -2547,8 +3098,8 @@ AtPrepare_PgStat(void)
  *
  * All we need do here is unlink the transaction stats state from the
  * nontransactional state.  The nontransactional action counts will be
- * reported to the stats collector immediately, while the effects on live
- * and dead tuple counts are preserved in the 2PC state file.
+ * reported to the activity stats facility immediately, while the effects on
+ * live and dead tuple counts are preserved in the 2PC state file.
  *
  * Note: AtEOXact_PgStat is not called during PREPARE.
  */
@@ -2593,7 +3144,7 @@ pgstat_twophase_postcommit(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, commit case */
     pgstat_info->t_counts.t_tuples_inserted += rec->tuples_inserted;
@@ -2629,7 +3180,7 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
     PgStat_TableStatus *pgstat_info;
 
     /* Find or create a tabstat entry for the rel */
-    pgstat_info = get_tabstat_entry(rec->t_id, rec->t_shared);
+    pgstat_info = get_local_tabstat_entry(rec->t_id, rec->t_shared);
 
     /* Same math as in AtEOXact_PgStat, abort case */
     if (rec->t_truncated)
@@ -2649,85 +3200,135 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
 /* ----------
  * pgstat_fetch_stat_dbentry() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one database or NULL. NULL doesn't mean
- *    that the database doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different database.
  * ----------
  */
 PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup the requested database; return NULL if not found
-     */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    PgStat_StatDBEntry *shent;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* Return cached result if it is valid. */
+    if (dbid != 0 && cached_dbent_key.databaseid == dbid)
+        return &cached_dbent;
+
+    shent = (PgStat_StatDBEntry *)
+        get_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_dbent, shent, sizeof(PgStat_StatDBEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_dbent_key.databaseid = dbid;
+
+    return &cached_dbent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_tabentry() -
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
- *    the collected statistics for one table or NULL. NULL doesn't mean
+ *    the activity statistics for one table or NULL. NULL doesn't mean
  *    that the table doesn't exist, it is just not yet known by the
- *    collector, so the caller is better off to report ZERO instead.
+ *    activity statistics facilities, so the caller is better off to
+ *    report ZERO instead.
  * ----------
  */
 PgStat_StatTabEntry *
 pgstat_fetch_stat_tabentry(Oid relid)
 {
-    Oid            dbid;
-    PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
-
-    /*
-     * Lookup our database, then look in its table hash table.
-     */
-    dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
+    tabentry = pgstat_fetch_stat_tabentry_extended(false, relid);
+    if (tabentry != NULL)
+        return tabentry;
 
     /*
      * If we didn't find it, maybe it's a shared table.
      */
-    dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
-    {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
-        if (tabentry)
-            return tabentry;
-    }
-
-    return NULL;
+    tabentry = pgstat_fetch_stat_tabentry_extended(true, relid);
+    return tabentry;
+}
+
+
+/* ----------
+ * pgstat_fetch_stat_tabentry_extended() -
+ *
+ *    Find table stats entry on backends in dbent. The returned entry is stored
+ *    in static memory so the content is valid until the next call of the same
+ *    function for the different table.
+ */
+PgStat_StatTabEntry *
+pgstat_fetch_stat_tabentry_extended(bool shared, Oid reloid)
+{
+    PgStat_StatTabEntry *shent;
+    Oid            dboid = (shared ? InvalidOid : MyDatabaseId);
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(reloid != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_tabent_key.databaseid == dboid &&
+        cached_tabent_key.objectid == reloid)
+        return &cached_tabent;
+
+    shent = (PgStat_StatTabEntry *)
+        get_stat_entry(PGSTAT_TYPE_TABLE, dboid, reloid, true, false, NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_tabent, shent, sizeof(PgStat_StatTabEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_tabent_key.databaseid = dboid;
+    cached_tabent_key.objectid = reloid;
+
+    return &cached_tabent;
+}
+
+
+/* ----------
+ * pgstat_copy_index_counters() -
+ *
+ *    Support function for index swapping. Copy a portion of the counters of the
+ *    relation to specified place.
+ * ----------
+ */
+void
+pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst)
+{
+    PgStat_StatTabEntry *tabentry;
+
+    /* No point fetching tabentry when dst is NULL */
+    if (!dst)
+        return;
+
+    tabentry = pgstat_fetch_stat_tabentry(relid);
+
+    if (!tabentry)
+        return;
+
+    dst->t_counts.t_numscans = tabentry->numscans;
+    dst->t_counts.t_tuples_returned = tabentry->tuples_returned;
+    dst->t_counts.t_tuples_fetched = tabentry->tuples_fetched;
+    dst->t_counts.t_blocks_fetched = tabentry->blocks_fetched;
+    dst->t_counts.t_blocks_hit = tabentry->blocks_hit;
 }
 
 
@@ -2736,30 +3337,46 @@ pgstat_fetch_stat_tabentry(Oid relid)
  *
  *    Support function for the SQL-callable pgstat* functions. Returns
  *    the collected statistics for one function or NULL.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *    the next call of the same function for the different function id.
  * ----------
  */
 PgStat_StatFuncEntry *
 pgstat_fetch_stat_funcentry(Oid func_id)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry = NULL;
-
-    /* load the stats file if needed */
-    backend_read_statsfile();
-
-    /* Lookup our database, then find the requested function.  */
-    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
-    return funcentry;
+    PgStat_StatFuncEntry *shent;
+    Oid            dboid = MyDatabaseId;
+
+    /* should be called from backends */
+    Assert(IsUnderPostmaster);
+
+    /* the simple cache doesn't work properly for the InvalidOid */
+    Assert(func_id != InvalidOid);
+
+    /* Return cached result if it is valid. */
+    if (cached_funcent_key.databaseid == dboid &&
+        cached_funcent_key.objectid == func_id)
+        return &cached_funcent;
+
+    shent = (PgStat_StatFuncEntry *)
+        get_stat_entry(PGSTAT_TYPE_FUNCTION, dboid, func_id, true, false,
+                       NULL);
+
+    if (!shent)
+        return NULL;
+
+    LWLockAcquire(&shent->header.lock, LW_SHARED);
+    memcpy(&cached_funcent, shent, sizeof(PgStat_StatFuncEntry));
+    LWLockRelease(&shent->header.lock);
+
+    /* remember the key for the cached entry */
+    cached_funcent_key.databaseid = dboid;
+    cached_funcent_key.objectid = func_id;
+
+    return &cached_funcent;
 }
 
-
 /* ----------
  * pgstat_fetch_stat_beentry() -
  *
@@ -2819,53 +3436,160 @@ pgstat_fetch_stat_numbackends(void)
     return localNumBackends;
 }
 
+/*
+ * ---------
+ * pgstat_get_stat_timestamp() -
+ *
+ *  Returns the last update timstamp of global staticstics.
+ */
+TimestampTz
+pgstat_get_stat_timestamp(void)
+{
+    return (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
+}
+
 /*
  * ---------
  * pgstat_fetch_stat_archiver() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the archiver statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_ArchiverStats *
+PgStat_Archiver *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    PgStat_Archiver reset;
+    PgStat_Archiver *reset_shared = &StatsShmem->archiver_reset_offset;
+    PgStat_Archiver *shared = &StatsShmem->archiver_stats;
+    PgStat_Archiver *cached = &cached_archiverstats;
 
-    return &archiverStats;
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_Archiver),
+                             &StatsShmem->archiver_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_Archiver));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    if (cached->archived_count == reset.archived_count)
+    {
+        cached->last_archived_wal[0] = 0;
+        cached->last_archived_timestamp = 0;
+    }
+    cached->archived_count -= reset.archived_count;
+
+    if (cached->failed_count == reset.failed_count)
+    {
+        cached->last_failed_wal[0] = 0;
+        cached->last_failed_timestamp = 0;
+    }
+    cached->failed_count -= reset.failed_count;
+
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_archiverstats_is_valid = true;
+
+    return &cached_archiverstats;
 }
 
 
 /*
  * ---------
- * pgstat_fetch_global() -
+ * pgstat_fetch_stat_bgwriter() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the global statistics struct.
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_GlobalStats *
-pgstat_fetch_global(void)
+PgStat_BgWriter *
+pgstat_fetch_stat_bgwriter(void)
 {
-    backend_read_statsfile();
+    PgStat_BgWriter reset;
+    PgStat_BgWriter *reset_shared = &StatsShmem->bgwriter_reset_offset;
+    PgStat_BgWriter *shared = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *cached = &cached_bgwriterstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_BgWriter),
+                             &StatsShmem->bgwriter_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_BgWriter));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->buf_written_clean -= reset.buf_written_clean;
+    cached->maxwritten_clean -= reset.maxwritten_clean;
+    cached->buf_alloc -= reset.buf_alloc;
+    cached->stat_reset_timestamp = reset.stat_reset_timestamp;
+
+    cached_bgwriterstats_is_valid = true;
+
+    return &cached_bgwriterstats;
+}
+
+/*
+ * ---------
+ * pgstat_fetch_stat_checkpinter() -
+ *
+ *    Support function for the SQL-callable pgstat* functions.  The returned
+ *  entry is stored in static memory so the content is valid until the next
+ *  call.
+ * ---------
+ */
+PgStat_CheckPointer *
+pgstat_fetch_stat_checkpointer(void)
+{
+    PgStat_CheckPointer reset;
+    PgStat_CheckPointer *reset_shared = &StatsShmem->checkpointer_reset_offset;
+    PgStat_CheckPointer *shared = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *cached = &cached_checkpointerstats;
+
+    pgstat_copy_global_stats(cached, shared, sizeof(PgStat_CheckPointer),
+                             &StatsShmem->checkpointer_changecount);
+
+    LWLockAcquire(StatsLock, LW_SHARED);
+    memcpy(&reset, reset_shared, sizeof(PgStat_CheckPointer));
+    LWLockRelease(StatsLock);
+
+    /* compensate by reset offsets */
+    cached->timed_checkpoints -= reset.timed_checkpoints;
+    cached->requested_checkpoints -= reset.requested_checkpoints;
+    cached->buf_written_checkpoints -= reset.buf_written_checkpoints;
+    cached->buf_written_backend -= reset.buf_written_backend;
+    cached->buf_fsync_backend -= reset.buf_fsync_backend;
+    cached->checkpoint_write_time -= reset.checkpoint_write_time;
+    cached->checkpoint_sync_time -= reset.checkpoint_sync_time;
+
+    cached_checkpointerstats_is_valid = true;
 
-    return &globalStats;
+    return &cached_checkpointerstats;
 }
 
 /*
  * ---------
  * pgstat_fetch_stat_wal() -
  *
- *    Support function for the SQL-callable pgstat* functions. Returns
- *    a pointer to the WAL statistics struct.
+ *    Support function for the SQL-callable pgstat* functions. The returned entry
+ *  is stored in static memory so the content is valid until the next
+ *  call.
  * ---------
  */
-PgStat_WalStats *
+PgStat_Wal *
 pgstat_fetch_stat_wal(void)
 {
-    backend_read_statsfile();
+    if (!cached_walstats_is_valid)
+    {
+        LWLockAcquire(StatsLock, LW_SHARED);
+        memcpy(&cached_walstats, &StatsShmem->wal_stats, sizeof(PgStat_Wal));
+        LWLockRelease(StatsLock);
+    }
 
-    return &walStats;
+    cached_walstats_is_valid = true;
+
+    return &cached_walstats;
 }
 
 /*
@@ -2879,9 +3603,27 @@ pgstat_fetch_stat_wal(void)
 PgStat_SLRUStats *
 pgstat_fetch_slru(void)
 {
-    backend_read_statsfile();
+    size_t        size = sizeof(PgStat_SLRUStats) * SLRU_NUM_ELEMENTS;
 
-    return slruStats;
+    for (;;)
+    {
+        uint32        before_count;
+        uint32        after_count;
+
+        pg_read_barrier();
+        before_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+        memcpy(&cached_slrustats, &StatsShmem->slru_stats, size);
+        after_count = pg_atomic_read_u32(&StatsShmem->slru_changecount);
+
+        if (before_count == after_count && (before_count & 1) == 0)
+            break;
+
+        CHECK_FOR_INTERRUPTS();
+    }
+
+    cached_slrustats_is_valid = true;
+
+    return &cached_slrustats;
 }
 
 /*
@@ -2893,13 +3635,47 @@ pgstat_fetch_slru(void)
  *    number of entries in nslots_p.
  * ---------
  */
-PgStat_ReplSlotStats *
+PgStat_ReplSlot *
 pgstat_fetch_replslot(int *nslots_p)
 {
-    backend_read_statsfile();
 
-    *nslots_p = nReplSlotStats;
-    return replSlotStats;
+    if (cached_replslotstats == NULL)
+    {
+        cached_replslotstats = (PgStat_ReplSlot *)
+            MemoryContextAlloc(pgStatCacheContext,
+                               sizeof(PgStat_ReplSlot) * max_replication_slots);
+    }
+
+    if (n_cached_replslotstats < 0)
+    {
+        int            n = 0;
+        int            i;
+
+        for (i = 0; i < max_replication_slots; i++)
+        {
+            PgStat_ReplSlot *shent = (PgStat_ReplSlot *)
+                get_stat_entry(PGSTAT_TYPE_REPLSLOT, MyDatabaseId, i,
+                               false, false, NULL);
+            if (!shent)
+                continue;
+
+            /* Skip if this slot is not used */
+            LWLockAcquire(&shent->header.lock, LW_SHARED);
+            if (shent->slotname[0] != '\0')
+            {
+                memcpy(cached_replslotstats[n++].slotname,
+                       shent->slotname,
+                       sizeof(PgStat_ReplSlot) -
+                       offsetof(PgStat_ReplSlot, slotname));
+            }
+            LWLockRelease(&shent->header.lock);
+        }
+
+        n_cached_replslotstats = n;
+    }
+
+    *nslots_p = n_cached_replslotstats;
+    return cached_replslotstats;
 }
 
 /* ------------------------------------------------------------
@@ -3124,8 +3900,8 @@ pgstat_initialize(void)
      */
     prevWalUsage = pgWalUsage;
 
-    /* Set up a process-exit hook to clean up */
-    on_shmem_exit(pgstat_beshutdown_hook, 0);
+    /* need to be called before dsm shutdown */
+    before_shmem_exit(pgstat_beshutdown_hook, 0);
 }
 
 /* ----------
@@ -3301,12 +4077,15 @@ pgstat_bestart(void)
     /* Update app name to current GUC setting */
     if (application_name)
         pgstat_report_appname(application_name);
+
+    /* attach shared database stats area */
+    attach_shared_stats();
 }
 
 /*
  * Shut down a single backend's statistics reporting at process exit.
  *
- * Flush any remaining statistics counts out to the collector.
+ * Flush any remaining statistics counts out to shared stats.
  * Without this, operations triggered during backend exit (such as
  * temp table deletions) won't be counted.
  *
@@ -3319,12 +4098,25 @@ pgstat_beshutdown_hook(int code, Datum arg)
 
     /*
      * If we got as far as discovering our own database ID, we can report what
-     * we did to the collector.  Otherwise, we'd be sending an invalid
+     * we did to the shares stats.  Otherwise, we'd be sending an invalid
      * database ID, so forget it.  (This means that accesses to pg_database
      * during failed backend starts might never get counted.)
      */
     if (OidIsValid(MyDatabaseId))
+    {
+        if (MyBackendType == B_BACKEND)
+            pgstat_update_connstats(true);
         pgstat_report_stat(true);
+    }
+
+    /*
+     * We need to clean up temporary slots before detaching shared statistics
+     * so that the statistics for temporary slots are properly removed.
+     */
+    if (MyReplicationSlot != NULL)
+        ReplicationSlotRelease();
+
+    ReplicationSlotCleanup();
 
     /*
      * Clear my status entry, following the protocol of bumping st_changecount
@@ -3336,6 +4128,8 @@ pgstat_beshutdown_hook(int code, Datum arg)
     beentry->st_procpid = 0;    /* mark invalid */
 
     PGSTAT_END_WRITE_ACTIVITY(beentry);
+
+    detach_shared_stats(true);
 }
 
 
@@ -3620,7 +4414,8 @@ pgstat_read_current_status(void)
 #endif
     int            i;
 
-    Assert(!pgStatRunningInCollector);
+    Assert(IsUnderPostmaster);
+
     if (localBackendStatusTable)
         return;                    /* already done */
 
@@ -3915,8 +4710,8 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_LOGICAL_LAUNCHER_MAIN:
             event_name = "LogicalLauncherMain";
             break;
-        case WAIT_EVENT_PGSTAT_MAIN:
-            event_name = "PgStatMain";
+        case WAIT_EVENT_READING_STATS_FILE:
+            event_name = "ReadingStatsFile";
             break;
         case WAIT_EVENT_RECOVERY_WAL_STREAM:
             event_name = "RecoveryWalStream";
@@ -4576,94 +5371,80 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 
 
 /* ----------
- * pgstat_setheader() -
+ * pgstat_report_archiver() -
  *
- *        Set common header fields in a statistics message
+ *        Report archiver statistics
  * ----------
  */
-static void
-pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype)
-{
-    hdr->m_type = mtype;
-}
-
-
-/* ----------
- * pgstat_send() -
- *
- *        Send out one statistics message to the collector
- * ----------
- */
-static void
-pgstat_send(void *msg, int len)
+void
+pgstat_report_archiver(const char *xlog, bool failed)
 {
-    int            rc;
+    TimestampTz now = GetCurrentTimestamp();
+    uint32        before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32        after_count PG_USED_FOR_ASSERTS_ONLY;
 
-    if (pgStatSock == PGINVALID_SOCKET)
-        return;
 
-    ((PgStat_MsgHdr *) msg)->m_size = len;
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert((before_count & 1) == 0);
 
-    /* We'll retry after EINTR, but ignore all other failures */
-    do
+    if (failed)
     {
-        rc = send(pgStatSock, msg, len, 0);
-    } while (rc < 0 && errno == EINTR);
-
-#ifdef USE_ASSERT_CHECKING
-    /* In debug builds, log send failures ... */
-    if (rc < 0)
-        elog(LOG, "could not send to statistics collector: %m");
-#endif
-}
-
-/* ----------
- * pgstat_send_archiver() -
- *
- *    Tell the collector about the WAL file that we successfully
- *    archived or failed to archive.
- * ----------
- */
-void
-pgstat_send_archiver(const char *xlog, bool failed)
-{
-    PgStat_MsgArchiver msg;
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-    msg.m_failed = failed;
-    strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-    msg.m_timestamp = GetCurrentTimestamp();
-    pgstat_send(&msg, sizeof(msg));
+        ++StatsShmem->archiver_stats.failed_count;
+        memcpy(&StatsShmem->archiver_stats.last_failed_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_failed_wal));
+        StatsShmem->archiver_stats.last_failed_timestamp = now;
+    }
+    else
+    {
+        ++StatsShmem->archiver_stats.archived_count;
+        memcpy(&StatsShmem->archiver_stats.last_archived_wal, xlog,
+               sizeof(StatsShmem->archiver_stats.last_archived_wal));
+        StatsShmem->archiver_stats.last_archived_timestamp = now;
+    }
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->archiver_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 }
 
 /* ----------
- * pgstat_send_bgwriter() -
+ * pgstat_report_bgwriter() -
  *
- *        Send bgwriter statistics to the collector
+ *        Report bgwriter statistics
  * ----------
  */
 void
-pgstat_send_bgwriter(void)
+pgstat_report_bgwriter(void)
 {
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgBgWriter all_zeroes;
+    static const PgStat_BgWriter all_zeroes;
+    PgStat_BgWriter *s = &StatsShmem->bgwriter_stats;
+    PgStat_BgWriter *l = &BgWriterStats;
+    uint32        before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32        after_count PG_USED_FOR_ASSERTS_ONLY;
 
     /*
      * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
+     * this case, avoid taking lock for a completely empty stats.
      */
-    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_MsgBgWriter)) == 0)
+    if (memcmp(&BgWriterStats, &all_zeroes, sizeof(PgStat_BgWriter)) == 0)
         return;
 
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&BgWriterStats.m_hdr, PGSTAT_MTYPE_BGWRITER);
-    pgstat_send(&BgWriterStats, sizeof(BgWriterStats));
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->buf_written_clean += l->buf_written_clean;
+    s->maxwritten_clean += l->maxwritten_clean;
+    s->buf_alloc += l->buf_alloc;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->bgwriter_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
 
     /*
      * Clear out the statistics buffer, so it can be re-used.
@@ -4671,589 +5452,194 @@ pgstat_send_bgwriter(void)
     MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
+/* ----------
+ * pgstat_report_checkpointer() -
+ *
+ *        Report checkpointer statistics
+ * ----------
+ */
+void
+pgstat_report_checkpointer(void)
+{
+    /* We assume this initializes to zeroes */
+    static const PgStat_CheckPointer all_zeroes;
+    PgStat_CheckPointer *s = &StatsShmem->checkpointer_stats;
+    PgStat_CheckPointer *l = &CheckPointerStats;
+    uint32        before_count PG_USED_FOR_ASSERTS_ONLY;
+    uint32        after_count PG_USED_FOR_ASSERTS_ONLY;
+
+    /*
+     * This function can be called even if nothing at all has happened. In
+     * this case, avoid taking lock for a completely empty stats.
+     */
+    if (memcmp(&CheckPointerStats, &all_zeroes,
+               sizeof(PgStat_CheckPointer)) == 0)
+        return;
+
+    START_CRIT_SECTION();
+    before_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert((before_count & 1) == 0);
+
+    s->timed_checkpoints += l->timed_checkpoints;
+    s->requested_checkpoints += l->requested_checkpoints;
+    s->checkpoint_write_time += l->checkpoint_write_time;
+    s->checkpoint_sync_time += l->checkpoint_sync_time;
+    s->buf_written_checkpoints += l->buf_written_checkpoints;
+    s->buf_written_backend += l->buf_written_backend;
+    s->buf_fsync_backend += l->buf_fsync_backend;
+
+    after_count =
+        pg_atomic_fetch_add_u32(&StatsShmem->checkpointer_changecount, 1);
+    Assert(after_count == before_count + 1);
+    END_CRIT_SECTION();
+
+    /*
+     * Clear out the statistics buffer, so it can be re-used.
+     */
+    MemSet(&CheckPointerStats, 0, sizeof(CheckPointerStats));
+}
+
 /* ----------
  * pgstat_report_wal() -
  *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
+ *        Report WAL statistics
  * ----------
  */
 void
 pgstat_report_wal(void)
 {
-    WalUsage    walusage;
+    flush_walstat(true);
+}
 
-    /*
-     * Calculate how much WAL usage counters are increased by substracting the
-     * previous counters from the current ones. Fill the results in WAL stats
-     * message.
-     */
-    MemSet(&walusage, 0, sizeof(WalUsage));
-    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+/* ----------
+ * pgstat_update_connstat() -
+ *
+ *        Update local connection stats
+ * ----------
+ */
+static void
+pgstat_update_connstats(bool disconnect)
+{
+    static TimestampTz last_report = 0;
+    static SessionEndType session_end_type = DISCONNECT_NOT_YET;
+    TimestampTz now;
+    long        secs;
+    int            usecs;
+    PgStat_StatDBEntry *ldbstats;    /* local database entry */
 
-    WalStats.m_wal_records = walusage.wal_records;
-    WalStats.m_wal_fpi = walusage.wal_fpi;
-    WalStats.m_wal_bytes = walusage.wal_bytes;
+    Assert(MyBackendType == B_BACKEND);
 
-    /*
-     * Send WAL stats message to the collector.
-     */
-    if (!pgstat_send_wal(true))
+    if (session_end_type != DISCONNECT_NOT_YET)
         return;
 
-    /*
-     * Save the current counters for the subsequent calculation of WAL usage.
-     */
-    prevWalUsage = pgWalUsage;
-}
+    now = GetCurrentTimestamp();
+    if (last_report == 0)
+        last_report = MyStartTimestamp;
+    TimestampDifference(last_report, now, &secs, &usecs);
+    last_report = now;
 
-/* ----------
- * pgstat_send_wal() -
- *
- *    Send WAL statistics to the collector.
- *
- * If 'force' is not set, WAL stats message is only sent if enough time has
- * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
- * ----------
- */
-bool
-pgstat_send_wal(bool force)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgWal all_zeroes;
-    static TimestampTz sendTime = 0;
+    if (disconnect)
+        session_end_type = pgStatSessionEndCause;
 
-    /*
-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.
-     */
-    if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-        return false;
+    ldbstats = get_local_dbstat_entry(MyDatabaseId);
 
-    if (!force)
-    {
-        TimestampTz now = GetCurrentTimestamp();
+    ldbstats->counts.n_sessions = (last_report == 0 ? 1 : 0);
+    ldbstats->counts.total_session_time += secs * 1000000 + usecs;
+    ldbstats->counts.total_active_time += pgStatActiveTime;
+    pgStatActiveTime = 0;
+    ldbstats->counts.total_idle_in_xact_time += pgStatTransactionIdleTime;
+    pgStatTransactionIdleTime = 0;
 
-        /*
-         * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-         * msec since we last sent one.
-         */
-        if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-            return false;
-        sendTime = now;
-    }
-
-    /*
-     * Prepare and send the message
-     */
-    pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
-    pgstat_send(&WalStats, sizeof(WalStats));
-
-    /*
-     * Clear out the statistics buffer, so it can be re-used.
-     */
-    MemSet(&WalStats, 0, sizeof(WalStats));
-
-    return true;
-}
-
-/* ----------
- * pgstat_send_slru() -
- *
- *        Send SLRU statistics to the collector
- * ----------
- */
-static void
-pgstat_send_slru(void)
-{
-    /* We assume this initializes to zeroes */
-    static const PgStat_MsgSLRU all_zeroes;
-
-    for (int i = 0; i < SLRU_NUM_ELEMENTS; i++)
+    switch (session_end_type)
     {
-        /*
-         * This function can be called even if nothing at all has happened. In
-         * this case, avoid sending a completely empty message to the stats
-         * collector.
-         */
-        if (memcmp(&SLRUStats[i], &all_zeroes, sizeof(PgStat_MsgSLRU)) == 0)
-            continue;
-
-        /* set the SLRU type before each send */
-        SLRUStats[i].m_index = i;
-
-        /*
-         * Prepare and send the message
-         */
-        pgstat_setheader(&SLRUStats[i].m_hdr, PGSTAT_MTYPE_SLRU);
-        pgstat_send(&SLRUStats[i], sizeof(PgStat_MsgSLRU));
-
-        /*
-         * Clear out the statistics buffer, so it can be re-used.
-         */
-        MemSet(&SLRUStats[i], 0, sizeof(PgStat_MsgSLRU));
-    }
-}
-
-
-/* ----------
- * PgstatCollectorMain() -
- *
- *    Start up the statistics collector process.  This is the body of the
- *    postmaster child process.
- *
- *    The argc/argv parameters are valid only in EXEC_BACKEND case.
- * ----------
- */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
-{
-    int            len;
-    PgStat_Msg    msg;
-    int            wr;
-    WaitEvent    event;
-    WaitEventSet *wes;
-
-    /*
-     * Ignore all signals usually bound to some action in the postmaster,
-     * except SIGHUP and SIGQUIT.  Note we don't need a SIGUSR1 handler to
-     * support latch operations, because we only use a local latch.
-     */
-    pqsignal(SIGHUP, SignalHandlerForConfigReload);
-    pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, SignalHandlerForShutdownRequest);
-    pqsignal(SIGALRM, SIG_IGN);
-    pqsignal(SIGPIPE, SIG_IGN);
-    pqsignal(SIGUSR1, SIG_IGN);
-    pqsignal(SIGUSR2, SIG_IGN);
-    /* Reset some signals that are accepted by postmaster but not here */
-    pqsignal(SIGCHLD, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
-
-    MyBackendType = B_STATS_COLLECTOR;
-    init_ps_display(NULL);
-
-    /*
-     * Read in existing stats files or initialize the stats to zero.
-     */
-    pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
-
-    /* Prepare to wait for our latch or data in our socket. */
-    wes = CreateWaitEventSet(CurrentMemoryContext, 3);
-    AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
-    AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL, NULL);
-    AddWaitEventToSet(wes, WL_SOCKET_READABLE, pgStatSock, NULL, NULL);
-
-    /*
-     * Loop to process messages until we get SIGQUIT or detect ungraceful
-     * death of our parent postmaster.
-     *
-     * For performance reasons, we don't want to do ResetLatch/WaitLatch after
-     * every message; instead, do that only after a recv() fails to obtain a
-     * message.  (This effectively means that if backends are sending us stuff
-     * like mad, we won't notice postmaster death until things slack off a
-     * bit; which seems fine.)    To do that, we have an inner loop that
-     * iterates as long as recv() succeeds.  We do check ConfigReloadPending
-     * inside the inner loop, which means that such interrupts will get
-     * serviced but the latch won't get cleared until next time there is a
-     * break in the action.
-     */
-    for (;;)
-    {
-        /* Clear any already-pending wakeups */
-        ResetLatch(MyLatch);
-
-        /*
-         * Quit if we get SIGQUIT from the postmaster.
-         */
-        if (ShutdownRequestPending)
+        case DISCONNECT_NOT_YET:
+        case DISCONNECT_NORMAL:
+            /* we don't collect these */
             break;
-
-        /*
-         * Inner loop iterates as long as we keep getting messages, or until
-         * ShutdownRequestPending becomes set.
-         */
-        while (!ShutdownRequestPending)
-        {
-            /*
-             * Reload configuration if we got SIGHUP from the postmaster.
-             */
-            if (ConfigReloadPending)
-            {
-                ConfigReloadPending = false;
-                ProcessConfigFile(PGC_SIGHUP);
-            }
-
-            /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
-             * Try to receive and process a message.  This will not block,
-             * since the socket is set to non-blocking mode.
-             *
-             * XXX On Windows, we have to force pgwin32_recv to cooperate,
-             * despite the previous use of pg_set_noblock() on the socket.
-             * This is extremely broken and should be fixed someday.
-             */
-#ifdef WIN32
-            pgwin32_noblock = 1;
-#endif
-
-            len = recv(pgStatSock, (char *) &msg,
-                       sizeof(PgStat_Msg), 0);
-
-#ifdef WIN32
-            pgwin32_noblock = 0;
-#endif
-
-            if (len < 0)
-            {
-                if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR)
-                    break;        /* out of inner loop */
-                ereport(ERROR,
-                        (errcode_for_socket_access(),
-                         errmsg("could not read statistics message: %m")));
-            }
-
-            /*
-             * We ignore messages that are smaller than our common header
-             */
-            if (len < sizeof(PgStat_MsgHdr))
-                continue;
-
-            /*
-             * The received length must match the length in the header
-             */
-            if (msg.msg_hdr.m_size != len)
-                continue;
-
-            /*
-             * O.K. - we accept this message.  Process it.
-             */
-            switch (msg.msg_hdr.m_type)
-            {
-                case PGSTAT_MTYPE_DUMMY:
-                    break;
-
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry(&msg.msg_inquiry, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABSTAT:
-                    pgstat_recv_tabstat(&msg.msg_tabstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_TABPURGE:
-                    pgstat_recv_tabpurge(&msg.msg_tabpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_DROPDB:
-                    pgstat_recv_dropdb(&msg.msg_dropdb, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETCOUNTER:
-                    pgstat_recv_resetcounter(&msg.msg_resetcounter, len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSHAREDCOUNTER:
-                    pgstat_recv_resetsharedcounter(&msg.msg_resetsharedcounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSINGLECOUNTER:
-                    pgstat_recv_resetsinglecounter(&msg.msg_resetsinglecounter,
-                                                   len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETSLRUCOUNTER:
-                    pgstat_recv_resetslrucounter(&msg.msg_resetslrucounter,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_RESETREPLSLOTCOUNTER:
-                    pgstat_recv_resetreplslotcounter(&msg.msg_resetreplslotcounter,
-                                                     len);
-                    break;
-
-                case PGSTAT_MTYPE_AUTOVAC_START:
-                    pgstat_recv_autovac(&msg.msg_autovacuum_start, len);
-                    break;
-
-                case PGSTAT_MTYPE_VACUUM:
-                    pgstat_recv_vacuum(&msg.msg_vacuum, len);
-                    break;
-
-                case PGSTAT_MTYPE_ANALYZE:
-                    pgstat_recv_analyze(&msg.msg_analyze, len);
-                    break;
-
-                case PGSTAT_MTYPE_ARCHIVER:
-                    pgstat_recv_archiver(&msg.msg_archiver, len);
-                    break;
-
-                case PGSTAT_MTYPE_BGWRITER:
-                    pgstat_recv_bgwriter(&msg.msg_bgwriter, len);
-                    break;
-
-                case PGSTAT_MTYPE_WAL:
-                    pgstat_recv_wal(&msg.msg_wal, len);
-                    break;
-
-                case PGSTAT_MTYPE_SLRU:
-                    pgstat_recv_slru(&msg.msg_slru, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCSTAT:
-                    pgstat_recv_funcstat(&msg.msg_funcstat, len);
-                    break;
-
-                case PGSTAT_MTYPE_FUNCPURGE:
-                    pgstat_recv_funcpurge(&msg.msg_funcpurge, len);
-                    break;
-
-                case PGSTAT_MTYPE_RECOVERYCONFLICT:
-                    pgstat_recv_recoveryconflict(&msg.msg_recoveryconflict,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_DEADLOCK:
-                    pgstat_recv_deadlock(&msg.msg_deadlock, len);
-                    break;
-
-                case PGSTAT_MTYPE_TEMPFILE:
-                    pgstat_recv_tempfile(&msg.msg_tempfile, len);
-                    break;
-
-                case PGSTAT_MTYPE_CHECKSUMFAILURE:
-                    pgstat_recv_checksum_failure(&msg.msg_checksumfailure,
-                                                 len);
-                    break;
-
-                case PGSTAT_MTYPE_REPLSLOT:
-                    pgstat_recv_replslot(&msg.msg_replslot, len);
-                    break;
-
-                case PGSTAT_MTYPE_CONNECTION:
-                    pgstat_recv_connstat(&msg.msg_conn, len);
-                    break;
-
-                default:
-                    break;
-            }
-        }                        /* end of inner message-processing loop */
-
-        /* Sleep until there's something to do */
-#ifndef WIN32
-        wr = WaitEventSetWait(wes, -1L, &event, 1, WAIT_EVENT_PGSTAT_MAIN);
-#else
-
-        /*
-         * Windows, at least in its Windows Server 2003 R2 incarnation,
-         * sometimes loses FD_READ events.  Waking up and retrying the recv()
-         * fixes that, so don't sleep indefinitely.  This is a crock of the
-         * first water, but until somebody wants to debug exactly what's
-         * happening there, this is the best we can do.  The two-second
-         * timeout matches our pre-9.2 behavior, and needs to be short enough
-         * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
-         */
-        wr = WaitEventSetWait(wes, 2 * 1000L /* msec */ , &event, 1,
-                              WAIT_EVENT_PGSTAT_MAIN);
-#endif
-
-        /*
-         * Emergency bailout if postmaster has died.  This is to avoid the
-         * necessity for manual cleanup of all postmaster children.
-         */
-        if (wr == 1 && event.events == WL_POSTMASTER_DEATH)
+        case DISCONNECT_CLIENT_EOF:
+            ldbstats->counts.n_sessions_abandoned++;
             break;
-    }                            /* end of outer loop */
-
-    /*
-     * Save the final stats to reuse at next startup.
-     */
-    pgstat_write_statsfiles(true, true);
-
-    FreeWaitEventSet(wes);
-
-    exit(0);
+        case DISCONNECT_FATAL:
+            ldbstats->counts.n_sessions_fatal++;
+            break;
+        case DISCONNECT_KILLED:
+            ldbstats->counts.n_sessions_killed++;
+            break;
+    }
 }
 
-/*
- * Subroutine to clear stats in a database entry
+/* ----------
+ * get_local_dbstat_entry() -
  *
- * Tables and functions hashes are initialized to empty.
- */
-static void
-reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
-{
-    HASHCTL        hash_ctl;
-
-    dbentry->n_xact_commit = 0;
-    dbentry->n_xact_rollback = 0;
-    dbentry->n_blocks_fetched = 0;
-    dbentry->n_blocks_hit = 0;
-    dbentry->n_tuples_returned = 0;
-    dbentry->n_tuples_fetched = 0;
-    dbentry->n_tuples_inserted = 0;
-    dbentry->n_tuples_updated = 0;
-    dbentry->n_tuples_deleted = 0;
-    dbentry->last_autovac_time = 0;
-    dbentry->n_conflict_tablespace = 0;
-    dbentry->n_conflict_lock = 0;
-    dbentry->n_conflict_snapshot = 0;
-    dbentry->n_conflict_bufferpin = 0;
-    dbentry->n_conflict_startup_deadlock = 0;
-    dbentry->n_temp_files = 0;
-    dbentry->n_temp_bytes = 0;
-    dbentry->n_deadlocks = 0;
-    dbentry->n_checksum_failures = 0;
-    dbentry->last_checksum_failure = 0;
-    dbentry->n_block_read_time = 0;
-    dbentry->n_block_write_time = 0;
-    dbentry->n_sessions = 0;
-    dbentry->total_session_time = 0;
-    dbentry->total_active_time = 0;
-    dbentry->total_idle_in_xact_time = 0;
-    dbentry->n_sessions_abandoned = 0;
-    dbentry->n_sessions_fatal = 0;
-    dbentry->n_sessions_killed = 0;
-
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-    dbentry->stats_timestamp = 0;
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
-
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
-}
-
-/*
- * Lookup the hash table entry for the specified database. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+ *  Find or create a local PgStat_StatDBEntry entry for dbid.  New entry is
+ *  created and initialized if not exists.
  */
 static PgStat_StatDBEntry *
-pgstat_get_db_entry(Oid databaseid, bool create)
+get_local_dbstat_entry(Oid dbid)
 {
-    PgStat_StatDBEntry *result;
+    PgStat_StatDBEntry *dbentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
-
-    /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
-
-    if (!create && !found)
-        return NULL;
 
     /*
-     * If not found, initialize the new one.  This creates empty hash tables
-     * for tables and functions, too.
+     * Find an entry or create a new one.
      */
-    if (!found)
-        reset_dbentry_counters(result);
+    dbentry = (PgStat_StatDBEntry *)
+        get_local_stat_entry(PGSTAT_TYPE_DB, dbid, InvalidOid,
+                             true, &found);
 
-    return result;
+    return dbentry;
 }
 
-
-/*
- * Lookup the hash table entry for the specified table. If no hash
- * table entry exists, initialize it, if the create parameter is true.
- * Else, return NULL.
+/* ----------
+ * get_local_tabstat_entry() -
+ *  Find or create a PgStat_TableStatus entry for rel. New entry is created and
+ *  initialized if not exists.
+ * ----------
  */
-static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
 {
-    PgStat_StatTabEntry *result;
+    PgStat_TableStatus *tabentry;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
-    /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    tabentry = (PgStat_TableStatus *)
+        get_local_stat_entry(PGSTAT_TYPE_TABLE,
+                             isshared ? InvalidOid : MyDatabaseId,
+                             rel_id, true, &found);
 
-    if (!create && !found)
-        return NULL;
-
-    /* If not found, initialize the new one. */
-    if (!found)
-    {
-        result->numscans = 0;
-        result->tuples_returned = 0;
-        result->tuples_fetched = 0;
-        result->tuples_inserted = 0;
-        result->tuples_updated = 0;
-        result->tuples_deleted = 0;
-        result->tuples_hot_updated = 0;
-        result->n_live_tuples = 0;
-        result->n_dead_tuples = 0;
-        result->changes_since_analyze = 0;
-        result->inserts_since_vacuum = 0;
-        result->blocks_fetched = 0;
-        result->blocks_hit = 0;
-        result->vacuum_timestamp = 0;
-        result->vacuum_count = 0;
-        result->autovac_vacuum_timestamp = 0;
-        result->autovac_vacuum_count = 0;
-        result->analyze_timestamp = 0;
-        result->analyze_count = 0;
-        result->autovac_analyze_timestamp = 0;
-        result->autovac_analyze_count = 0;
-    }
-
-    return result;
+    return tabentry;
 }
 
-
 /* ----------
- * pgstat_write_statsfiles() -
- *        Write the global statistics file, as well as requested DB files.
+ * pgstat_write_statsfile() -
+ *        Write the global statistics file, as well as DB files.
  *
- *    'permanent' specifies writing to the permanent files not temporary ones.
- *    When true (happens only when the collector is shutting down), also remove
- *    the temporary files so that backends starting up under a new postmaster
- *    can't read old data before the new collector is ready.
- *
- *    When 'allDbs' is false, only the requested databases (listed in
- *    pending_write_requests) will be written; otherwise, all databases
- *    will be written.
+ * This function is called in the last process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfile(void)
 {
-    HASH_SEQ_STATUS hstat;
-    PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
-    int            i;
+    dshash_seq_status hstat;
+    PgStatHashEntry *ps;
+
+    /* stats is not initialized yet. just return. */
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID)
+        return;
+
+    /* this is the last process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 0);
+    LWLockRelease(StatsLock);
+#endif
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -5273,7 +5659,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    pg_atomic_write_u64(&StatsShmem->stats_timestamp, GetCurrentTimestamp());
 
     /*
      * Write the file header --- currently just a format ID.
@@ -5283,200 +5669,72 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write global stats struct
+     * Write bgwriter global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(&StatsShmem->bgwriter_stats, sizeof(PgStat_BgWriter), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write archiver stats struct
+     * Write checkpointer global stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(&StatsShmem->checkpointer_stats, sizeof(PgStat_CheckPointer), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Write WAL stats struct
+     * Write archiver global stats struct
      */
-    rc = fwrite(&walStats, sizeof(walStats), 1, fpout);
+    rc = fwrite(&StatsShmem->archiver_stats, sizeof(PgStat_Archiver), 1,
+                fpout);
+    (void) rc;                    /* we'll check for error with ferror */
+
+    /*
+     * Write WAL global stats struct
+     */
+    rc = fwrite(&StatsShmem->wal_stats, sizeof(PgStat_Wal), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write SLRU stats struct
      */
-    rc = fwrite(slruStats, sizeof(slruStats), 1, fpout);
+    rc = fwrite(&StatsShmem->slru_stats, sizeof(PgStatSharedSLRUStats), 1,
+                fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
-     * Walk through the database table.
+     * Walk through the stats entry
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, pgStatSharedHash, false);
+    while ((ps = dshash_seq_next(&hstat)) != NULL)
     {
-        /*
-         * Write out the table and function stats for this DB into the
-         * appropriate per-DB stat file, if required.
-         */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
+        PgStat_StatEntryHeader *shent;
+        size_t        len;
+
+        CHECK_FOR_INTERRUPTS();
+
+        shent = (PgStat_StatEntryHeader *) dsa_get_address(area, ps->body);
+
+        /* we may have some "dropped" entries not yet removed, skip them */
+        if (shent->dropped)
+            continue;
+
+        /* Make DB's timestamp consistent with the global stats */
+        if (ps->key.type == PGSTAT_TYPE_DB)
         {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+            PgStat_StatDBEntry *dbentry = (PgStat_StatDBEntry *) shent;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
+            dbentry->stats_timestamp =
+                (TimestampTz) pg_atomic_read_u64(&StatsShmem->stats_timestamp);
         }
 
-        /*
-         * Write out the DB entry. We don't write the tables or functions
-         * pointers, since they're of no use to any other process.
-         */
-        fputc('D', fpout);
-        rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * Write replication slot stats struct
-     */
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        fputc('R', fpout);
-        rc = fwrite(&replSlotStats[i], sizeof(PgStat_ReplSlotStats), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
-
-    /*
-     * No more output to be done. Close the temp file and replace the old
-     * pgstat.stat with it.  The ferror() check replaces testing for error
-     * after each individual fputc or fwrite above.
-     */
-    fputc('E', fpout);
-
-    if (ferror(fpout))
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not write temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        FreeFile(fpout);
-        unlink(tmpfile);
-    }
-    else if (FreeFile(fpout) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not close temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        unlink(tmpfile);
-    }
-    else if (rename(tmpfile, statfile) < 0)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-                        tmpfile, statfile)));
-        unlink(tmpfile);
-    }
-
-    if (permanent)
-        unlink(pgstat_stat_filename);
-
-    /*
-     * Now throw away the list of requests.  Note that requests sent after we
-     * started the write are still waiting on the network socket.
-     */
-    list_free(pending_write_requests);
-    pending_write_requests = NIL;
-}
-
-/*
- * return the filename for a DB stat file; filename is the output buffer,
- * of length len.
- */
-static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
-                    char *filename, int len)
-{
-    int            printed;
-
-    /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
-    printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
-                       databaseid,
-                       tempname ? "tmp" : "stat");
-    if (printed >= len)
-        elog(ERROR, "overlength pgstat path");
-}
-
-/* ----------
- * pgstat_write_db_statsfile() -
- *        Write the stat file for a single database.
- *
- *    If writing to the permanent file (happens when the collector is
- *    shutting down only), remove the temporary file so that backends
- *    starting up under a new postmaster can't read the old data before
- *    the new collector is ready.
- * ----------
- */
-static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
-{
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpout;
-    int32        format_id;
-    Oid            dbid = dbentry->databaseid;
-    int            rc;
-    char        tmpfile[MAXPGPATH];
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
-
-    elog(DEBUG2, "writing stats file \"%s\"", statfile);
-
-    /*
-     * Open the statistics temp file to write out the current values.
-     */
-    fpout = AllocateFile(tmpfile, PG_BINARY_W);
-    if (fpout == NULL)
-    {
-        ereport(LOG,
-                (errcode_for_file_access(),
-                 errmsg("could not open temporary statistics file \"%s\": %m",
-                        tmpfile)));
-        return;
-    }
-
-    /*
-     * Write the file header --- currently just a format ID.
-     */
-    format_id = PGSTAT_FILE_FORMAT_ID;
-    rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
-    (void) rc;                    /* we'll check for error with ferror */
-
-    /*
-     * Walk through the database's access stats per table.
-     */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
-    {
-        fputc('T', fpout);
-        rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
-        (void) rc;                /* we'll check for error with ferror */
-    }
+        fputc('S', fpout);
+        rc = fwrite(&ps->key, sizeof(PgStatHashKey), 1, fpout);
 
-    /*
-     * Walk through the database's function stats table.
-     */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
-    {
-        fputc('F', fpout);
-        rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+        /* Write except the header part of the etnry */
+        len = PGSTAT_SHENT_BODY_LEN(ps->key.type);
+        rc = fwrite(PGSTAT_SHENT_BODY(shent), len, 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_seq_term(&hstat);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -5510,115 +5768,65 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
- * pgstat_read_statsfiles() -
- *
- *    Reads in some existing statistics collector files and returns the
- *    databases hash table that is the top level of the data.
+ * pgstat_read_statsfile() -
  *
- *    If 'onlydb' is not InvalidOid, it means we only want data for that DB
- *    plus the shared catalogs ("DB 0").  We'll still populate the DB hash
- *    table for all databases, but we don't bother even creating table/function
- *    hash tables for other databases.
+ *    Reads in existing activity statistics file into the shared stats hash.
  *
- *    'permanent' specifies reading from the permanent files not temporary ones.
- *    When true (happens only when the collector is starting up), remove the
- *    files after reading; the in-memory status is now authoritative, and the
- *    files would be out of date in case somebody else reads them.
- *
- *    If a 'deep' read is requested, table/function stats are read, otherwise
- *    the table/function hash tables remain empty.
+ * This function is called in the only process that is accessing the shared
+ * stats so locking is not required.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfile(void)
 {
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-    int            i;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    char        tag;
 
-    /*
-     * The tables will live in pgStatLocalContext.
-     */
-    pgstat_setup_memcxt();
+    /* shouldn't be called from postmaster */
+    Assert(IsUnderPostmaster);
 
-    /*
-     * Create the DB hashtable
-     */
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+    /* this is the only process that is accesing the shared stats */
+#ifdef USE_ASSERT_CHECKING
+    LWLockAcquire(StatsLock, LW_SHARED);
+    Assert(StatsShmem->refcount == 1);
+    LWLockRelease(StatsLock);
+#endif
 
-    /* Allocate the space for replication slot statistics */
-    replSlotStats = MemoryContextAllocZero(pgStatLocalContext,
-                                           max_replication_slots
-                                           * sizeof(PgStat_ReplSlotStats));
-    nReplSlotStats = 0;
-
-    /*
-     * Clear out global, archiver, WAL and SLRU statistics so they start from
-     * zero in case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
-    memset(&walStats, 0, sizeof(walStats));
-    memset(&slruStats, 0, sizeof(slruStats));
+    elog(DEBUG2, "reading stats file \"%s\"", statfile);
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-    walStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all SLRU items too.
-     */
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-        slruStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
-
-    /*
-     * Set the same reset timestamp for all replication slots too.
-     */
-    for (i = 0; i < max_replication_slots; i++)
-        replSlotStats[i].stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    StatsShmem->bgwriter_stats.stat_reset_timestamp = GetCurrentTimestamp();
+    StatsShmem->archiver_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
+    StatsShmem->wal_stats.stat_reset_timestamp =
+        StatsShmem->bgwriter_stats.stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
+     * returns zero for anything and the activity statistics simply starts
+     * from scratch with empty counters.
      *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
+     * ENOENT is a possibility if the activity statistics is not running or
+     * has not yet written the stats file the first time.  Any other failure
      * condition is suspicious.
      */
     if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
     {
         if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
+            ereport(LOG,
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -5627,681 +5835,150 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
         format_id != PGSTAT_FILE_FORMAT_ID)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        goto done;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
-        goto done;
-    }
-
-    /*
-     * In the collector, disregard the timestamp we read from the permanent
-     * stats file; we should be willing to write a temp stats file immediately
-     * upon the first request from any backend.  This only matters if the old
-     * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
-     * an unusual scenario.
-     */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
-
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        goto done;
-    }
-
-    /*
-     * Read WAL stats struct
-     */
-    if (fread(&walStats, 1, sizeof(walStats), fpin) != sizeof(walStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&walStats, 0, sizeof(walStats));
         goto done;
     }
 
     /*
-     * Read SLRU stats struct
+     * Read bgwiter stats struct
      */
-    if (fread(slruStats, 1, sizeof(slruStats), fpin) != sizeof(slruStats))
+    if (fread(&StatsShmem->bgwriter_stats, 1, sizeof(PgStat_BgWriter), fpin) !=
+        sizeof(PgStat_BgWriter))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&slruStats, 0, sizeof(slruStats));
+        MemSet(&StatsShmem->bgwriter_stats, 0, sizeof(PgStat_BgWriter));
         goto done;
     }
 
     /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Add to the DB hash
-                 */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
-
-                /*
-                 * In the collector, disregard the timestamp we read from the
-                 * permanent stats file; we should be willing to write a temp
-                 * stats file immediately upon the first request from any
-                 * backend.
-                 */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                /*
-                 * If requested, read the data from the database-specific
-                 * file.  Otherwise we just leave the hashtables empty.
-                 */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&replSlotStats[nReplSlotStats], 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-                    goto done;
-                }
-                nReplSlotStats++;
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-
-    return dbhash;
-}
-
-
-/* ----------
- * pgstat_read_db_statsfile() -
- *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
- * ----------
- */
-static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
-{
-    PgStat_StatTabEntry *tabentry;
-    PgStat_StatTabEntry tabbuf;
-    PgStat_StatFuncEntry funcbuf;
-    PgStat_StatFuncEntry *funcentry;
-    FILE       *fpin;
-    int32        format_id;
-    bool        found;
-    char        statfile[MAXPGPATH];
-
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
-
-    /*
-     * Try to open the stats file. If it doesn't exist, the backends simply
-     * return zero for anything and the collector simply starts from scratch
-     * with empty counters.
-     *
-     * ENOENT is a possibility if the stats collector is not running or has
-     * not yet written the stats file the first time.  Any other failure
-     * condition is suspicious.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return;
-    }
-
-    /*
-     * Verify it's of the expected format.
+     * Read checkpointer stats struct
      */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (fread(&StatsShmem->checkpointer_stats, 1, sizeof(PgStat_CheckPointer), fpin) !=
+        sizeof(PgStat_CheckPointer))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
+        MemSet(&StatsShmem->checkpointer_stats, 0, sizeof(PgStat_CheckPointer));
         goto done;
     }
 
-    /*
-     * We found an existing collector stats file. Read it and put all the
-     * hashtable entries into place.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'T'    A PgStat_StatTabEntry follows.
-                 */
-            case 'T':
-                if (fread(&tabbuf, 1, sizeof(PgStat_StatTabEntry),
-                          fpin) != sizeof(PgStat_StatTabEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if table data not wanted.
-                 */
-                if (tabhash == NULL)
-                    break;
-
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(tabentry, &tabbuf, sizeof(tabbuf));
-                break;
-
-                /*
-                 * 'F'    A PgStat_StatFuncEntry follows.
-                 */
-            case 'F':
-                if (fread(&funcbuf, 1, sizeof(PgStat_StatFuncEntry),
-                          fpin) != sizeof(PgStat_StatFuncEntry))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                /*
-                 * Skip if function data not wanted.
-                 */
-                if (funchash == NULL)
-                    break;
-
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
-
-                if (found)
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
-
-                memcpy(funcentry, &funcbuf, sizeof(funcbuf));
-                break;
-
-                /*
-                 * 'E'    The EOF marker of a complete stats file.
-                 */
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
-        }
-    }
-
-done:
-    FreeFile(fpin);
-
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
-}
-
-/* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts. The caller must
- *    rely on timestamp stored in *ts iff the function returns true.
- *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
- * ----------
- */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
-{
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    PgStat_WalStats myWalStats;
-    PgStat_SLRUStats mySLRUStats[SLRU_NUM_ELEMENTS];
-    PgStat_ReplSlotStats myReplSlotStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
     /*
      * Read archiver stats struct
      */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
+    if (fread(&StatsShmem->archiver_stats, 1, sizeof(PgStat_Archiver),
+              fpin) != sizeof(PgStat_Archiver))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->archiver_stats, 0, sizeof(PgStat_Archiver));
+        goto done;
     }
 
     /*
      * Read WAL stats struct
      */
-    if (fread(&myWalStats, 1, sizeof(myWalStats), fpin) != sizeof(myWalStats))
+    if (fread(&StatsShmem->wal_stats, 1, sizeof(PgStat_Wal), fpin)
+        != sizeof(PgStat_Wal))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
+        MemSet(&StatsShmem->wal_stats, 0, sizeof(PgStat_Wal));
+        goto done;
     }
 
     /*
      * Read SLRU stats struct
      */
-    if (fread(mySLRUStats, 1, sizeof(mySLRUStats), fpin) != sizeof(mySLRUStats))
+    if (fread(&StatsShmem->slru_stats, 1, sizeof(PgStatSharedSLRUStats),
+              fpin) != sizeof(PgStatSharedSLRUStats))
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
+        ereport(LOG,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
-        {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-                /*
-                 * 'R'    A PgStat_ReplSlotStats struct describing a replication
-                 * slot follows.
-                 */
-            case 'R':
-                if (fread(&myReplSlotStats, 1, sizeof(PgStat_ReplSlotStats), fpin)
-                    != sizeof(PgStat_ReplSlotStats))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    FreeFile(fpin);
-                    return false;
-                }
-        }
+        goto done;
     }
 
-done:
-    FreeFile(fpin);
-    return true;
-}
-
-/*
- * If not already done, read the statistics collector stats file into
- * some hash tables.  The results will be kept until pgstat_clear_snapshot()
- * is called (typically, at end of transaction).
- */
-static void
-backend_read_statsfile(void)
-{
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
-
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
-    Assert(!pgStatRunningInCollector);
-
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
     /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
+     * We found an existing activity statistics file. Read it and put all the
+     * hash table entries into place.
      */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    while ((tag = fgetc(fpin)) == 'S')
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
+        PgStatHashKey key;
+        PgStat_StatEntryHeader *p;
+        size_t        len;
 
         CHECK_FOR_INTERRUPTS();
 
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
-
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
+        if (fread(&key, 1, sizeof(key), fpin) != sizeof(key))
         {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
         }
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                ereport(LOG,
-                        (errmsg("statistics collector's time %s is later than backend local time %s",
-                                filetime, mytime)));
-                pfree(filetime);
-                pfree(mytime);
-            }
+        p = get_stat_entry(key.type, key.databaseid, key.objectid,
+                           false, true, &found);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
+        /* don't allow duplicate entries */
+        if (found)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"",
+                            statfile)));
+            goto done;
         }
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+        /* Avoid overwriting header part */
+        len = PGSTAT_SHENT_BODY_LEN(key.type);
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+        if (fread(PGSTAT_SHENT_BODY(p), 1, len, fpin) != len)
+        {
+            ereport(LOG,
+                    (errmsg("corrupted statistics file \"%s\"", statfile)));
+            goto done;
+        }
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
+    if (tag != 'E')
+    {
         ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+                (errmsg("corrupted statistics file \"%s\"",
+                        statfile)));
+        goto done;
+    }
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+done:
+    FreeFile(fpin);
+
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
+
+    return;
 }
 
-
 /* ----------
  * pgstat_setup_memcxt() -
  *
- *    Create pgStatLocalContext, if not already done.
+ *    Create pgStatLocalContext if not already done.
  * ----------
  */
 static void
 pgstat_setup_memcxt(void)
 {
     if (!pgStatLocalContext)
-        pgStatLocalContext = AllocSetContextCreate(TopMemoryContext,
-                                                   "Statistics snapshot",
-                                                   ALLOCSET_SMALL_SIZES);
-}
+        pgStatLocalContext =
+            AllocSetContextCreate(TopMemoryContext,
+                                  "Backend statistics snapshot",
+                                  ALLOCSET_SMALL_SIZES);
 
+    if (!pgStatCacheContext)
+        pgStatCacheContext =
+            AllocSetContextCreate(CacheMemoryContext,
+                                  "Activity statistics",
+                                  ALLOCSET_SMALL_SIZES);
+}
 
 /* ----------
  * pgstat_clear_snapshot() -
@@ -6318,947 +5995,25 @@ pgstat_clear_snapshot(void)
 {
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
+    {
         MemoryContextDelete(pgStatLocalContext);
 
-    /* Reset variables */
-    pgStatLocalContext = NULL;
-    pgStatDBHash = NULL;
-    localBackendStatusTable = NULL;
-    localNumBackends = 0;
-    replSlotStats = NULL;
-    nReplSlotStats = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            ereport(LOG,
-                    (errmsg("stats_timestamp %s is later than collector's time %s for database %u",
-                            writetime, mytime, dbentry->databaseid)));
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
-
-    /*
-     * We need to write this DB, so create a request.
-     */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
-}
-
-
-/* ----------
- * pgstat_recv_tabstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Update database-wide stats.
-     */
-    dbentry->n_xact_commit += (PgStat_Counter) (msg->m_xact_commit);
-    dbentry->n_xact_rollback += (PgStat_Counter) (msg->m_xact_rollback);
-    dbentry->n_block_read_time += msg->m_block_read_time;
-    dbentry->n_block_write_time += msg->m_block_write_time;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
-
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new table entry, initialize counters to the values we
-             * just got.
-             */
-            tabentry->numscans = tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned = tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched = tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated = tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted = tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated = tabmsg->t_counts.t_tuples_hot_updated;
-            tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit = tabmsg->t_counts.t_blocks_hit;
-
-            tabentry->vacuum_timestamp = 0;
-            tabentry->vacuum_count = 0;
-            tabentry->autovac_vacuum_timestamp = 0;
-            tabentry->autovac_vacuum_count = 0;
-            tabentry->analyze_timestamp = 0;
-            tabentry->analyze_count = 0;
-            tabentry->autovac_analyze_timestamp = 0;
-            tabentry->autovac_analyze_count = 0;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            tabentry->numscans += tabmsg->t_counts.t_numscans;
-            tabentry->tuples_returned += tabmsg->t_counts.t_tuples_returned;
-            tabentry->tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-            tabentry->tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->tuples_updated += tabmsg->t_counts.t_tuples_updated;
-            tabentry->tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-            tabentry->tuples_hot_updated += tabmsg->t_counts.t_tuples_hot_updated;
-            /* If table was truncated, first reset the live/dead counters */
-            if (tabmsg->t_counts.t_truncated)
-            {
-                tabentry->n_live_tuples = 0;
-                tabentry->n_dead_tuples = 0;
-                tabentry->inserts_since_vacuum = 0;
-            }
-            tabentry->n_live_tuples += tabmsg->t_counts.t_delta_live_tuples;
-            tabentry->n_dead_tuples += tabmsg->t_counts.t_delta_dead_tuples;
-            tabentry->changes_since_analyze += tabmsg->t_counts.t_changed_tuples;
-            tabentry->inserts_since_vacuum += tabmsg->t_counts.t_tuples_inserted;
-            tabentry->blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-            tabentry->blocks_hit += tabmsg->t_counts.t_blocks_hit;
-        }
-
-        /* Clamp n_live_tuples in case of negative delta_live_tuples */
-        tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
-        /* Likewise for n_dead_tuples */
-        tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
-
-        /*
-         * Add per-table stats to the per-database entry, too.
-         */
-        dbentry->n_tuples_returned += tabmsg->t_counts.t_tuples_returned;
-        dbentry->n_tuples_fetched += tabmsg->t_counts.t_tuples_fetched;
-        dbentry->n_tuples_inserted += tabmsg->t_counts.t_tuples_inserted;
-        dbentry->n_tuples_updated += tabmsg->t_counts.t_tuples_updated;
-        dbentry->n_tuples_deleted += tabmsg->t_counts.t_tuples_deleted;
-        dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
-        dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_tabpurge() -
- *
- *    Arrange for dead table removal.
- * ----------
- */
-static void
-pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->tables)
-        return;
-
-    /*
-     * Process all table entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_dropdb() -
- *
- *    Arrange for dead database removal
- * ----------
- */
-static void
-pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
-{
-    Oid            dbid = msg->m_databaseid;
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.
-     */
-    dbentry = pgstat_get_db_entry(dbid, false);
-
-    /*
-     * If found, remove it (along with the db statfile).
-     */
-    if (dbentry)
-    {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
-
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
-    }
-}
-
-
-/* ----------
- * pgstat_recv_resetcounter() -
- *
- *    Reset the statistics for the specified database.
- * ----------
- */
-static void
-pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Lookup the database in the hashtable.  Nothing to do if not there.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /*
-     * We simply throw away all the database's table entries by recreating a
-     * new hash table for them.
-     */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
-
-    /*
-     * Reset database-level stats, too.  This creates empty hash tables for
-     * tables and functions.
-     */
-    reset_dbentry_counters(dbentry);
-}
-
-/* ----------
- * pgstat_recv_resetsharedcounter() -
- *
- *    Reset some shared statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
-{
-    if (msg->m_resettarget == RESET_BGWRITER)
-    {
-        /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_ARCHIVER)
-    {
-        /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-    else if (msg->m_resettarget == RESET_WAL)
-    {
-        /* Reset the WAL statistics for the cluster. */
-        memset(&walStats, 0, sizeof(walStats));
-        walStats.stat_reset_timestamp = GetCurrentTimestamp();
-    }
-
-    /*
-     * Presumably the sender of this message validated the target, don't
-     * complain here if it's not valid
-     */
-}
-
-/* ----------
- * pgstat_recv_resetsinglecounter() -
- *
- *    Reset a statistics for a single object
- * ----------
- */
-static void
-pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    if (!dbentry)
-        return;
-
-    /* Set the reset timestamp for the whole database */
-    dbentry->stat_reset_timestamp = GetCurrentTimestamp();
-
-    /* Remove object if it exists, ignore it if not */
-    if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-    else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
-}
-
-/* ----------
- * pgstat_recv_resetslrucounter() -
- *
- *    Reset some SLRU statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetslrucounter(PgStat_MsgResetslrucounter *msg, int len)
-{
-    int            i;
-    TimestampTz ts = GetCurrentTimestamp();
-
-    for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
-    {
-        /* reset entry with the given index, or all entries (index is -1) */
-        if ((msg->m_index == -1) || (msg->m_index == i))
-        {
-            memset(&slruStats[i], 0, sizeof(slruStats[i]));
-            slruStats[i].stat_reset_timestamp = ts;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_resetreplslotcounter() -
- *
- *    Reset some replication slot statistics of the cluster.
- * ----------
- */
-static void
-pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
-                                 int len)
-{
-    int            i;
-    int            idx = -1;
-    TimestampTz ts;
-
-    ts = GetCurrentTimestamp();
-    if (msg->clearall)
-    {
-        for (i = 0; i < nReplSlotStats; i++)
-            pgstat_reset_replslot(i, ts);
-    }
-    else
-    {
-        /* Get the index of replication slot statistics to reset */
-        idx = pgstat_replslot_index(msg->m_slotname, false);
-
-        /*
-         * Nothing to do if the given slot entry is not found.  This could
-         * happen when the slot with the given name is removed and the
-         * corresponding statistics entry is also removed before receiving the
-         * reset message.
-         */
-        if (idx < 0)
-            return;
-
-        /* Reset the stats for the requested replication slot */
-        pgstat_reset_replslot(idx, ts);
-    }
-}
-
-
-/* ----------
- * pgstat_recv_autovac() -
- *
- *    Process an autovacuum signaling message.
- * ----------
- */
-static void
-pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    /*
-     * Store the last autovacuum time in the database's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->last_autovac_time = msg->m_start_time;
-}
-
-/* ----------
- * pgstat_recv_vacuum() -
- *
- *    Process a VACUUM message.
- * ----------
- */
-static void
-pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * It is quite possible that a non-aggressive VACUUM ended up skipping
-     * various pages, however, we'll zero the insert counter here regardless.
-     * It's currently used only to track when we need to perform an "insert"
-     * autovacuum, which are mainly intended to freeze newly inserted tuples.
-     * Zeroing this may just mean we'll not try to vacuum the table again
-     * until enough tuples have been inserted to trigger another insert
-     * autovacuum.  An anti-wraparound autovacuum will catch any persistent
-     * stragglers.
-     */
-    tabentry->inserts_since_vacuum = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
-    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
-    }
-}
-
-/* ----------
- * pgstat_recv_analyze() -
- *
- *    Process an ANALYZE message.
- * ----------
- */
-static void
-pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatTabEntry *tabentry;
-
-    /*
-     * Store the data in the table's hashtable entry.
-     */
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
-
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-
-    /*
-     * If commanded, reset changes_since_analyze to zero.  This forgets any
-     * changes that were committed while the ANALYZE was in progress, but we
-     * have no good way to estimate how many of those there were.
-     */
-    if (msg->m_resetcounter)
-        tabentry->changes_since_analyze = 0;
-
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_analyze_timestamp = msg->m_analyzetime;
-        tabentry->autovac_analyze_count++;
-    }
-    else
-    {
-        tabentry->analyze_timestamp = msg->m_analyzetime;
-        tabentry->analyze_count++;
-    }
-}
-
-
-/* ----------
- * pgstat_recv_archiver() -
- *
- *    Process a ARCHIVER message.
- * ----------
- */
-static void
-pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
-{
-    if (msg->m_failed)
-    {
-        /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
-    }
-    else
-    {
-        /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
-    }
-}
-
-/* ----------
- * pgstat_recv_bgwriter() -
- *
- *    Process a BGWRITER message.
- * ----------
- */
-static void
-pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
-{
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
-}
-
-/* ----------
- * pgstat_recv_wal() -
- *
- *    Process a WAL message.
- * ----------
- */
-static void
-pgstat_recv_wal(PgStat_MsgWal *msg, int len)
-{
-    walStats.wal_records += msg->m_wal_records;
-    walStats.wal_fpi += msg->m_wal_fpi;
-    walStats.wal_bytes += msg->m_wal_bytes;
-    walStats.wal_buffers_full += msg->m_wal_buffers_full;
-    walStats.wal_write += msg->m_wal_write;
-    walStats.wal_sync += msg->m_wal_sync;
-    walStats.wal_write_time += msg->m_wal_write_time;
-    walStats.wal_sync_time += msg->m_wal_sync_time;
-}
-
-/* ----------
- * pgstat_recv_slru() -
- *
- *    Process a SLRU message.
- * ----------
- */
-static void
-pgstat_recv_slru(PgStat_MsgSLRU *msg, int len)
-{
-    slruStats[msg->m_index].blocks_zeroed += msg->m_blocks_zeroed;
-    slruStats[msg->m_index].blocks_hit += msg->m_blocks_hit;
-    slruStats[msg->m_index].blocks_read += msg->m_blocks_read;
-    slruStats[msg->m_index].blocks_written += msg->m_blocks_written;
-    slruStats[msg->m_index].blocks_exists += msg->m_blocks_exists;
-    slruStats[msg->m_index].flush += msg->m_flush;
-    slruStats[msg->m_index].truncate += msg->m_truncate;
-}
-
-/* ----------
- * pgstat_recv_recoveryconflict() -
- *
- *    Process a RECOVERYCONFLICT message.
- * ----------
- */
-static void
-pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    switch (msg->m_reason)
-    {
-        case PROCSIG_RECOVERY_CONFLICT_DATABASE:
-
-            /*
-             * Since we drop the information about the database as soon as it
-             * replicates, there is no point in counting these conflicts.
-             */
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
-            dbentry->n_conflict_tablespace++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_LOCK:
-            dbentry->n_conflict_lock++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
-            dbentry->n_conflict_snapshot++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
-            dbentry->n_conflict_bufferpin++;
-            break;
-        case PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK:
-            dbentry->n_conflict_startup_deadlock++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_deadlock() -
- *
- *    Process a DEADLOCK message.
- * ----------
- */
-static void
-pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_deadlocks++;
-}
-
-/* ----------
- * pgstat_recv_checksum_failure() -
- *
- *    Process a CHECKSUMFAILURE message.
- * ----------
- */
-static void
-pgstat_recv_checksum_failure(PgStat_MsgChecksumFailure *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_checksum_failures += msg->m_failurecount;
-    dbentry->last_checksum_failure = msg->m_failure_time;
-}
-
-/* ----------
- * pgstat_recv_replslot() -
- *
- *    Process a REPLSLOT message.
- * ----------
- */
-static void
-pgstat_recv_replslot(PgStat_MsgReplSlot *msg, int len)
-{
-    int            idx;
-
-    /*
-     * Get the index of replication slot statistics.  On dropping, we don't
-     * create the new statistics.
-     */
-    idx = pgstat_replslot_index(msg->m_slotname, !msg->m_drop);
-
-    /*
-     * The slot entry is not found or there is no space to accommodate the new
-     * entry.  This could happen when the message for the creation of a slot
-     * reached before the drop message even though the actual operations
-     * happen in reverse order.  In such a case, the next update of the
-     * statistics for the same slot will create the required entry.
-     */
-    if (idx < 0)
-        return;
-
-    /* it must be a valid replication slot index */
-    Assert(idx < nReplSlotStats);
-
-    if (msg->m_drop)
-    {
-        /* Remove the replication slot statistics with the given name */
-        if (idx < nReplSlotStats - 1)
-            memcpy(&replSlotStats[idx], &replSlotStats[nReplSlotStats - 1],
-                   sizeof(PgStat_ReplSlotStats));
-        nReplSlotStats--;
-    }
-    else
-    {
-        /* Update the replication slot statistics */
-        replSlotStats[idx].spill_txns += msg->m_spill_txns;
-        replSlotStats[idx].spill_count += msg->m_spill_count;
-        replSlotStats[idx].spill_bytes += msg->m_spill_bytes;
-        replSlotStats[idx].stream_txns += msg->m_stream_txns;
-        replSlotStats[idx].stream_count += msg->m_stream_count;
-        replSlotStats[idx].stream_bytes += msg->m_stream_bytes;
-    }
-}
-
-/* ----------
- * pgstat_recv_connstat() -
- *
- *  Process connection information.
- * ----------
- */
-static void
-pgstat_recv_connstat(PgStat_MsgConn *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_sessions += msg->m_count;
-    dbentry->total_session_time += msg->m_session_time;
-    dbentry->total_active_time += msg->m_active_time;
-    dbentry->total_idle_in_xact_time += msg->m_idle_in_xact_time;
-    switch (msg->m_disconnect)
-    {
-        case DISCONNECT_NOT_YET:
-        case DISCONNECT_NORMAL:
-            /* we don't collect these */
-            break;
-        case DISCONNECT_CLIENT_EOF:
-            dbentry->n_sessions_abandoned++;
-            break;
-        case DISCONNECT_FATAL:
-            dbentry->n_sessions_fatal++;
-            break;
-        case DISCONNECT_KILLED:
-            dbentry->n_sessions_killed++;
-            break;
-    }
-}
-
-/* ----------
- * pgstat_recv_tempfile() -
- *
- *    Process a TEMPFILE message.
- * ----------
- */
-static void
-pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    dbentry->n_temp_bytes += msg->m_filesize;
-    dbentry->n_temp_files += 1;
-}
-
-/* ----------
- * pgstat_recv_funcstat() -
- *
- *    Count what the backend has done.
- * ----------
- */
-static void
-pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
-{
-    PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
-    PgStat_StatDBEntry *dbentry;
-    PgStat_StatFuncEntry *funcentry;
-    int            i;
-    bool        found;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++, funcmsg++)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
-
-        if (!found)
-        {
-            /*
-             * If it's a new function entry, initialize counters to the values
-             * we just got.
-             */
-            funcentry->f_numcalls = funcmsg->f_numcalls;
-            funcentry->f_total_time = funcmsg->f_total_time;
-            funcentry->f_self_time = funcmsg->f_self_time;
-        }
-        else
-        {
-            /*
-             * Otherwise add the values to the existing entry.
-             */
-            funcentry->f_numcalls += funcmsg->f_numcalls;
-            funcentry->f_total_time += funcmsg->f_total_time;
-            funcentry->f_self_time += funcmsg->f_self_time;
-        }
-    }
-}
-
-/* ----------
- * pgstat_recv_funcpurge() -
- *
- *    Arrange for dead function removal.
- * ----------
- */
-static void
-pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-    int            i;
-
-    dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
-    /*
-     * No need to purge if we don't even know the database.
-     */
-    if (!dbentry || !dbentry->functions)
-        return;
-
-    /*
-     * Process all function entries in the message.
-     */
-    for (i = 0; i < msg->m_nentries; i++)
-    {
-        /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
-    }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+        /* Reset variables */
+        pgStatLocalContext = NULL;
+        localBackendStatusTable = NULL;
+        localNumBackends = 0;
+    }
+
+    /* Invalidate the simple cache keys */
+    cached_dbent_key = stathashkey_zero;
+    cached_tabent_key = stathashkey_zero;
+    cached_funcent_key = stathashkey_zero;
+    cached_archiverstats_is_valid = false;
+    cached_bgwriterstats_is_valid = false;
+    cached_checkpointerstats_is_valid = false;
+    cached_walstats_is_valid = false;
+    cached_slrustats_is_valid = false;
+    n_cached_replslotstats = -1;
 }
 
 /*
@@ -7305,60 +6060,6 @@ pgstat_clip_activity(const char *raw_activity)
     return activity;
 }
 
-/* ----------
- * pgstat_replslot_index
- *
- * Return the index of entry of a replication slot with the given name, or
- * -1 if the slot is not found.
- *
- * create_it tells whether to create the new slot entry if it is not found.
- * ----------
- */
-static int
-pgstat_replslot_index(const char *name, bool create_it)
-{
-    int            i;
-
-    Assert(nReplSlotStats <= max_replication_slots);
-    for (i = 0; i < nReplSlotStats; i++)
-    {
-        if (strcmp(replSlotStats[i].slotname, name) == 0)
-            return i;            /* found */
-    }
-
-    /*
-     * The slot is not found.  We don't want to register the new statistics if
-     * the list is already full or the caller didn't request.
-     */
-    if (i == max_replication_slots || !create_it)
-        return -1;
-
-    /* Register new slot */
-    memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-    strlcpy(replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
-
-    return nReplSlotStats++;
-}
-
-/* ----------
- * pgstat_reset_replslot
- *
- * Reset the replication slot stats at index 'i'.
- * ----------
- */
-static void
-pgstat_reset_replslot(int i, TimestampTz ts)
-{
-    /* reset only counters. Don't clear slot name */
-    replSlotStats[i].spill_txns = 0;
-    replSlotStats[i].spill_count = 0;
-    replSlotStats[i].spill_bytes = 0;
-    replSlotStats[i].stream_txns = 0;
-    replSlotStats[i].stream_count = 0;
-    replSlotStats[i].stream_bytes = 0;
-    replSlotStats[i].stat_reset_timestamp = ts;
-}
-
 /*
  * pgstat_slru_index
  *
@@ -7403,7 +6104,7 @@ pgstat_slru_name(int slru_idx)
  * Returns pointer to entry with counters for given SLRU (based on the name
  * stored in SlruCtl as lwlock tranche name).
  */
-static inline PgStat_MsgSLRU *
+static PgStat_SLRUStats *
 slru_entry(int slru_idx)
 {
     /*
@@ -7414,7 +6115,7 @@ slru_entry(int slru_idx)
 
     Assert((slru_idx >= 0) && (slru_idx < SLRU_NUM_ELEMENTS));
 
-    return &SLRUStats[slru_idx];
+    return &local_SLRUStats[slru_idx];
 }
 
 /*
@@ -7424,41 +6125,48 @@ slru_entry(int slru_idx)
 void
 pgstat_count_slru_page_zeroed(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_zeroed += 1;
+    slru_entry(slru_idx)->blocks_zeroed += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_hit(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_hit += 1;
+    slru_entry(slru_idx)->blocks_hit += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_exists(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_exists += 1;
+    slru_entry(slru_idx)->blocks_exists += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_read(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_read += 1;
+    slru_entry(slru_idx)->blocks_read += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_page_written(int slru_idx)
 {
-    slru_entry(slru_idx)->m_blocks_written += 1;
+    slru_entry(slru_idx)->blocks_written += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_flush(int slru_idx)
 {
-    slru_entry(slru_idx)->m_flush += 1;
+    slru_entry(slru_idx)->flush += 1;
+    have_slrustats = true;
 }
 
 void
 pgstat_count_slru_truncate(int slru_idx)
 {
-    slru_entry(slru_idx)->m_truncate += 1;
+    slru_entry(slru_idx)->truncate += 1;
+    have_slrustats = true;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e8af05c04e..5d5802f61d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -251,7 +251,6 @@ static pid_t StartupPID = 0,
             WalReceiverPID = 0,
             AutoVacPID = 0,
             PgArchPID = 0,
-            PgStatPID = 0,
             SysLoggerPID = 0;
 
 /* Startup process's status */
@@ -513,7 +512,6 @@ typedef struct
     PGPROC       *AuxiliaryProcs;
     PGPROC       *PreparedXactProcs;
     PMSignalData *PMSignalState;
-    InheritableSocket pgStatSock;
     pid_t        PostmasterPid;
     TimestampTz PgStartTime;
     TimestampTz PgReloadTime;
@@ -1332,12 +1330,6 @@ PostmasterMain(int argc, char *argv[])
      */
     RemovePgTempFiles();
 
-    /*
-     * Initialize stats collection subsystem (this does NOT start the
-     * collector process!)
-     */
-    pgstat_init();
-
     /*
      * Initialize the autovacuum subsystem (again, no process start yet)
      */
@@ -1787,11 +1779,6 @@ ServerLoop(void)
                 start_autovac_launcher = false; /* signal processed */
         }
 
-        /* If we have lost the stats collector, try to start a new one */
-        if (PgStatPID == 0 &&
-            (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
-
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
             PgArchPID = StartArchiver();
@@ -2680,8 +2667,6 @@ SIGHUP_handler(SIGNAL_ARGS)
             signal_child(PgArchPID, SIGHUP);
         if (SysLoggerPID != 0)
             signal_child(SysLoggerPID, SIGHUP);
-        if (PgStatPID != 0)
-            signal_child(PgStatPID, SIGHUP);
 
         /* Reload authentication config files too */
         if (!load_hba())
@@ -3010,8 +2995,6 @@ reaper(SIGNAL_ARGS)
                 AutoVacPID = StartAutoVacLauncher();
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = StartArchiver();
-            if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3078,13 +3061,6 @@ reaper(SIGNAL_ARGS)
                 SignalChildren(SIGUSR2);
 
                 pmState = PM_SHUTDOWN_2;
-
-                /*
-                 * We can also shut down the stats collector now; there's
-                 * nothing left for it to do.
-                 */
-                if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
             }
             else
             {
@@ -3163,22 +3139,6 @@ reaper(SIGNAL_ARGS)
             continue;
         }
 
-        /*
-         * Was it the statistics collector?  If so, just try to start a new
-         * one; no need to force reset of the rest of the system.  (If fail,
-         * we'll try again in future cycles of the main loop.)
-         */
-        if (pid == PgStatPID)
-        {
-            PgStatPID = 0;
-            if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
-            if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
-            continue;
-        }
-
         /* Was it the system logger?  If so, try to start a new one */
         if (pid == SysLoggerPID)
         {
@@ -3625,22 +3585,6 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
         signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
     }
 
-    /*
-     * Force a power-cycle of the pgstat process too.  (This isn't absolutely
-     * necessary, but it seems like a good idea for robustness, and it
-     * simplifies the state-machine logic in the case where a shutdown request
-     * arrives during crash processing.)
-     */
-    if (PgStatPID != 0 && take_action)
-    {
-        ereport(DEBUG2,
-                (errmsg_internal("sending %s to process %d",
-                                 "SIGQUIT",
-                                 (int) PgStatPID)));
-        signal_child(PgStatPID, SIGQUIT);
-        allow_immediate_pgstat_restart();
-    }
-
     /* We do NOT restart the syslogger */
 
     if (Shutdown != ImmediateShutdown)
@@ -3866,8 +3810,6 @@ PostmasterStateMachine(void)
                     SignalChildren(SIGQUIT);
                     if (PgArchPID != 0)
                         signal_child(PgArchPID, SIGQUIT);
-                    if (PgStatPID != 0)
-                        signal_child(PgStatPID, SIGQUIT);
                 }
             }
         }
@@ -3891,8 +3833,7 @@ PostmasterStateMachine(void)
     {
         /*
          * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
-         * (ie, no dead_end children remain), and the archiver and stats
-         * collector are gone too.
+         * (ie, no dead_end children remain), and the archiveris gone too.
          *
          * The reason we wait for those two is to protect them against a new
          * postmaster starting conflicting subprocesses; this isn't an
@@ -3902,8 +3843,7 @@ PostmasterStateMachine(void)
          * normal state transition leading up to PM_WAIT_DEAD_END, or during
          * FatalError processing.
          */
-        if (dlist_is_empty(&BackendList) &&
-            PgArchPID == 0 && PgStatPID == 0)
+        if (dlist_is_empty(&BackendList) && PgArchPID == 0)
         {
             /* These other guys should be dead already */
             Assert(StartupPID == 0);
@@ -4104,8 +4044,6 @@ TerminateChildren(int signal)
         signal_child(AutoVacPID, signal);
     if (PgArchPID != 0)
         signal_child(PgArchPID, signal);
-    if (PgStatPID != 0)
-        signal_child(PgStatPID, signal);
 }
 
 /*
@@ -5038,12 +4976,6 @@ SubPostmasterMain(int argc, char *argv[])
 
         StartBackgroundWorker();
     }
-    if (strcmp(argv[1], "--forkcol") == 0)
-    {
-        /* Do not want to attach to shared memory */
-
-        PgstatCollectorMain(argc, argv);    /* does not return */
-    }
     if (strcmp(argv[1], "--forklog") == 0)
     {
         /* Do not want to attach to shared memory */
@@ -5156,12 +5088,6 @@ sigusr1_handler(SIGNAL_ARGS)
     if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
         pmState == PM_RECOVERY && Shutdown == NoShutdown)
     {
-        /*
-         * Likewise, start other special children as needed.
-         */
-        Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
-
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
 
@@ -6078,7 +6004,6 @@ extern slock_t *ShmemLock;
 extern slock_t *ProcStructLock;
 extern PGPROC *AuxiliaryProcs;
 extern PMSignalData *PMSignalState;
-extern pgsocket pgStatSock;
 extern pg_time_t first_syslogger_file_time;
 
 #ifndef WIN32
@@ -6134,8 +6059,6 @@ save_backend_variables(BackendParameters *param, Port *port,
     param->AuxiliaryProcs = AuxiliaryProcs;
     param->PreparedXactProcs = PreparedXactProcs;
     param->PMSignalState = PMSignalState;
-    if (!write_inheritable_socket(¶m->pgStatSock, pgStatSock, childPid))
-        return false;
 
     param->PostmasterPid = PostmasterPid;
     param->PgStartTime = PgStartTime;
@@ -6368,7 +6291,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     AuxiliaryProcs = param->AuxiliaryProcs;
     PreparedXactProcs = param->PreparedXactProcs;
     PMSignalState = param->PMSignalState;
-    read_inheritable_socket(&pgStatSock, ¶m->pgStatSock);
 
     PostmasterPid = param->PostmasterPid;
     PgStartTime = param->PgStartTime;
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 626fae8454..37d454003d 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -258,7 +258,7 @@ WalWriterMain(void)
             left_till_hibernate--;
 
         /* Send WAL statistics to the stats collector */
-        pgstat_send_wal(false);
+        pgstat_report_wal();
 
         /*
          * Sleep until we are signaled or WalWriterDelay has elapsed.  If we
@@ -293,17 +293,5 @@ HandleWalWriterInterrupts(void)
     }
 
     if (ShutdownRequestPending)
-    {
-        /*
-         * Force to send remaining WAL statistics to the stats collector at
-         * process exit.
-         *
-         * Since pgstat_send_wal is invoked with 'force' is false in main loop
-         * to avoid overloading to the stats collector, there may exist unsent
-         * stats counters for the WAL writer.
-         */
-        pgstat_send_wal(true);
-
         proc_exit(0);
-    }
 }
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 56cd473f9f..3d35b55cf9 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1568,8 +1568,8 @@ is_checksummed_file(const char *fullpath, const char *filename)
  *
  * If 'missing_ok' is true, will not throw an error if the file is not found.
  *
- * If dboid is anything other than InvalidOid then any checksum failures detected
- * will get reported to the stats collector.
+ * If dboid is anything other than InvalidOid then any checksum failures
+ * detected will get reported to the activity stats facility.
  *
  * Returns true if the file was successfully sent, false if 'missing_ok',
  * and the file did not exist.
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 75a087c2f9..1f4ba2cafc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -700,14 +700,10 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
                 (errmsg("could not remove directory \"%s\"", tmppath)));
 
     /*
-     * Send a message to drop the replication slot to the stats collector.
-     * Since there is no guarantee of the order of message transfer on a UDP
-     * connection, it's possible that a message for creating a new slot
-     * reaches before a message for removing the old slot. We send the drop
-     * and create messages while holding ReplicationSlotAllocationLock to
-     * reduce that possibility. If the messages reached in reverse, we would
-     * lose one statistics update message. But the next update message will
-     * create the statistics for the replication slot.
+     * Drop the statistics entry for the replication slot.  Do this while
+     * holding ReplicationSlotAllocationLock so that we don't drop a statistics
+     * entry for another slot with the same name just created on another
+     * session.
      */
     if (SlotIsLogical(slot))
         pgstat_report_replslot_drop(NameStr(slot->data.name));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..197df48ae8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2065,7 +2065,7 @@ BufferSync(int flags)
             if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
             {
                 TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-                BgWriterStats.m_buf_written_checkpoints++;
+                CheckPointerStats.buf_written_checkpoints++;
                 num_written++;
             }
         }
@@ -2175,7 +2175,7 @@ BgBufferSync(WritebackContext *wb_context)
     strategy_buf_id = StrategySyncStart(&strategy_passes, &recent_alloc);
 
     /* Report buffer alloc counts to pgstat */
-    BgWriterStats.m_buf_alloc += recent_alloc;
+    BgWriterStats.buf_alloc += recent_alloc;
 
     /*
      * If we're not running the LRU scan, just stop after doing the stats
@@ -2365,7 +2365,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
             if (++num_written >= bgwriter_lru_maxpages)
             {
-                BgWriterStats.m_maxwritten_clean++;
+                BgWriterStats.maxwritten_clean++;
                 break;
             }
         }
@@ -2373,7 +2373,7 @@ BgBufferSync(WritebackContext *wb_context)
             reusable_buffers++;
     }
 
-    BgWriterStats.m_buf_written_clean += num_written;
+    BgWriterStats.buf_written_clean += num_written;
 
 #ifdef BGW_DEBUG
     elog(DEBUG1, "bgwriter: recent_alloc=%u smoothed=%.2f delta=%ld ahead=%d density=%.2f reusable_est=%d
upcoming_est=%dscanned=%d wrote=%d reusable=%d",
 
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 3e4ec53a97..9f2640211f 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -269,6 +270,7 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index adf19aba75..918c884823 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -175,7 +175,9 @@ static const char *const BuiltinTrancheNames[] = {
     /* LWTRANCHE_PARALLEL_APPEND: */
     "ParallelAppend",
     /* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
-    "PerXactPredicateList"
+    "PerXactPredicateList",
+    /* LWTRANCHE_STATS: */
+    "ActivityStatistics"
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 6c7cf6c295..8edb41c1cf 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -53,3 +53,4 @@ XactTruncationLock                    44
 # 45 was XactTruncationLock until removal of BackendRandomLock
 WrapLimitsVacuumLock                46
 NotifyQueueTailLock                    47
+StatsLock                            48
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..75d1695576 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -414,8 +414,8 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
     }
 
     /*
-     * It'd be nice to tell the stats collector to forget them immediately,
-     * too. But we can't because we don't know the OIDs.
+     * It'd be nice to tell the activity stats facility to forget them
+     * immediately, too. But we can't because we don't know the OIDs.
      */
 
     /*
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2b1b68109f..b73c917b8f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3295,6 +3295,12 @@ ProcessInterrupts(void)
 
     if (ParallelMessagePending)
         HandleParallelMessages();
+
+    if (IdleStatsUpdateTimeoutPending)
+    {
+        IdleStatsUpdateTimeoutPending = false;
+        pgstat_report_stat(true);
+    }
 }
 
 
@@ -3866,6 +3872,7 @@ PostgresMain(int argc, char *argv[],
     volatile bool send_ready_for_query = true;
     bool        idle_in_transaction_timeout_enabled = false;
     bool        idle_session_timeout_enabled = false;
+    bool        idle_stats_update_timeout_enabled = false;
 
     /* Initialize startup process environment if necessary. */
     if (!IsUnderPostmaster)
@@ -4262,11 +4269,12 @@ PostgresMain(int argc, char *argv[],
          * Note: this includes fflush()'ing the last of the prior output.
          *
          * This is also a good time to send collected statistics to the
-         * collector, and to update the PS stats display.  We avoid doing
-         * those every time through the message loop because it'd slow down
-         * processing of batched messages, and because we don't want to report
-         * uncommitted updates (that confuses autovacuum).  The notification
-         * processor wants a call too, if we are not in a transaction block.
+         * activity statistics, and to update the PS stats display.  We avoid
+         * doing those every time through the message loop because it'd slow
+         * down processing of batched messages, and because we don't want to
+         * report uncommitted updates (that confuses autovacuum).  The
+         * notification processor wants a call too, if we are not in a
+         * transaction block.
          *
          * Also, if an idle timeout is enabled, start the timer for that.
          */
@@ -4300,6 +4308,8 @@ PostgresMain(int argc, char *argv[],
             }
             else
             {
+                long stats_timeout;
+
                 /* Send out notify signals and transmit self-notifies */
                 ProcessCompletedNotifies();
 
@@ -4312,8 +4322,14 @@ PostgresMain(int argc, char *argv[],
                 if (notifyInterruptPending)
                     ProcessNotifyInterrupt();
 
-                pgstat_report_stat(false);
-
+                /* Start the idle-stats-update timer */
+                stats_timeout = pgstat_report_stat(false);
+                if (stats_timeout > 0)
+                {
+                    idle_stats_update_timeout_enabled = true;
+                    enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
+                                         stats_timeout);
+                }
                 set_ps_display("idle");
                 pgstat_report_activity(STATE_IDLE, NULL);
 
@@ -4347,9 +4363,9 @@ PostgresMain(int argc, char *argv[],
         firstchar = ReadCommand(&input_message);
 
         /*
-         * (4) turn off the idle-in-transaction and idle-session timeouts, if
-         * active.  We do this before step (5) so that any last-moment timeout
-         * is certain to be detected in step (5).
+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so
+         * that any last-moment timeout is certain to be detected in step (5).
          *
          * At most one of these timeouts will be active, so there's no need to
          * worry about combining the timeout.c calls into one.
@@ -4364,6 +4380,11 @@ PostgresMain(int argc, char *argv[],
             disable_timeout(IDLE_SESSION_TIMEOUT, false);
             idle_session_timeout_enabled = false;
         }
+        if (idle_stats_update_timeout_enabled)
+        {
+            disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false);
+            idle_stats_update_timeout_enabled = false;
+        }
 
         /*
          * (5) disable async signal conditions again.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 5102227a60..daa062b603 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pgstatfuncs.c
- *      Functions for accessing the statistics collector data
+ *      Functions for accessing the activity statistics data
  *
  * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -35,9 +35,6 @@
 
 #define HAS_PGSTAT_PERMISSIONS(role)     (is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS) ||
has_privs_of_role(GetUserId(),role))
 
 
-/* Global bgwriter statistics, from bgwriter.c */
-extern PgStat_MsgBgWriter bgwriterStats;
-
 Datum
 pg_stat_get_numscans(PG_FUNCTION_ARGS)
 {
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);
 
     PG_RETURN_INT64(result);
 }
@@ -1283,7 +1280,7 @@ pg_stat_get_db_xact_rollback(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_rollback);
+        result = (int64) (dbentry->counts.n_xact_rollback);
 
     PG_RETURN_INT64(result);
 }
@@ -1299,7 +1296,7 @@ pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_fetched);
+        result = (int64) (dbentry->counts.n_blocks_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1315,7 +1312,7 @@ pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_blocks_hit);
+        result = (int64) (dbentry->counts.n_blocks_hit);
 
     PG_RETURN_INT64(result);
 }
@@ -1331,7 +1328,7 @@ pg_stat_get_db_tuples_returned(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_returned);
+        result = (int64) (dbentry->counts.n_tuples_returned);
 
     PG_RETURN_INT64(result);
 }
@@ -1347,7 +1344,7 @@ pg_stat_get_db_tuples_fetched(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_fetched);
+        result = (int64) (dbentry->counts.n_tuples_fetched);
 
     PG_RETURN_INT64(result);
 }
@@ -1363,7 +1360,7 @@ pg_stat_get_db_tuples_inserted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_inserted);
+        result = (int64) (dbentry->counts.n_tuples_inserted);
 
     PG_RETURN_INT64(result);
 }
@@ -1379,7 +1376,7 @@ pg_stat_get_db_tuples_updated(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_updated);
+        result = (int64) (dbentry->counts.n_tuples_updated);
 
     PG_RETURN_INT64(result);
 }
@@ -1395,7 +1392,7 @@ pg_stat_get_db_tuples_deleted(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_tuples_deleted);
+        result = (int64) (dbentry->counts.n_tuples_deleted);
 
     PG_RETURN_INT64(result);
 }
@@ -1428,7 +1425,7 @@ pg_stat_get_db_temp_files(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_files;
+        result = dbentry->counts.n_temp_files;
 
     PG_RETURN_INT64(result);
 }
@@ -1444,7 +1441,7 @@ pg_stat_get_db_temp_bytes(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = dbentry->n_temp_bytes;
+        result = dbentry->counts.n_temp_bytes;
 
     PG_RETURN_INT64(result);
 }
@@ -1459,7 +1456,7 @@ pg_stat_get_db_conflict_tablespace(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace);
+        result = (int64) (dbentry->counts.n_conflict_tablespace);
 
     PG_RETURN_INT64(result);
 }
@@ -1474,7 +1471,7 @@ pg_stat_get_db_conflict_lock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_lock);
+        result = (int64) (dbentry->counts.n_conflict_lock);
 
     PG_RETURN_INT64(result);
 }
@@ -1489,7 +1486,7 @@ pg_stat_get_db_conflict_snapshot(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_snapshot);
+        result = (int64) (dbentry->counts.n_conflict_snapshot);
 
     PG_RETURN_INT64(result);
 }
@@ -1504,7 +1501,7 @@ pg_stat_get_db_conflict_bufferpin(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_bufferpin);
+        result = (int64) (dbentry->counts.n_conflict_bufferpin);
 
     PG_RETURN_INT64(result);
 }
@@ -1519,7 +1516,7 @@ pg_stat_get_db_conflict_startup_deadlock(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1534,11 +1531,11 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_conflict_tablespace +
-                          dbentry->n_conflict_lock +
-                          dbentry->n_conflict_snapshot +
-                          dbentry->n_conflict_bufferpin +
-                          dbentry->n_conflict_startup_deadlock);
+        result = (int64) (dbentry->counts.n_conflict_tablespace +
+                          dbentry->counts.n_conflict_lock +
+                          dbentry->counts.n_conflict_snapshot +
+                          dbentry->counts.n_conflict_bufferpin +
+                          dbentry->counts.n_conflict_startup_deadlock);
 
     PG_RETURN_INT64(result);
 }
@@ -1553,7 +1550,7 @@ pg_stat_get_db_deadlocks(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_deadlocks);
+        result = (int64) (dbentry->counts.n_deadlocks);
 
     PG_RETURN_INT64(result);
 }
@@ -1571,7 +1568,7 @@ pg_stat_get_db_checksum_failures(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_checksum_failures);
+        result = (int64) (dbentry->counts.n_checksum_failures);
 
     PG_RETURN_INT64(result);
 }
@@ -1608,7 +1605,7 @@ pg_stat_get_db_blk_read_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_read_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_read_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1624,7 +1621,7 @@ pg_stat_get_db_blk_write_time(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = ((double) dbentry->n_block_write_time) / 1000.0;
+        result = ((double) dbentry->counts.n_block_write_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1638,7 +1635,7 @@ pg_stat_get_db_session_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_session_time) / 1000.0;
+        result = ((double) dbentry->counts.total_session_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1652,7 +1649,7 @@ pg_stat_get_db_active_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_active_time) / 1000.0;
+        result = ((double) dbentry->counts.total_active_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1666,7 +1663,7 @@ pg_stat_get_db_idle_in_transaction_time(PG_FUNCTION_ARGS)
 
     /* convert counter from microsec to millisec for display */
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = ((double) dbentry->total_idle_in_xact_time) / 1000.0;
+        result = ((double) dbentry->counts.total_idle_in_xact_time) / 1000.0;
 
     PG_RETURN_FLOAT8(result);
 }
@@ -1679,7 +1676,7 @@ pg_stat_get_db_sessions(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions);
+        result = (int64) (dbentry->counts.n_sessions);
 
     PG_RETURN_INT64(result);
 }
@@ -1692,7 +1689,7 @@ pg_stat_get_db_sessions_abandoned(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_abandoned);
+        result = (int64) (dbentry->counts.n_sessions_abandoned);
 
     PG_RETURN_INT64(result);
 }
@@ -1705,7 +1702,7 @@ pg_stat_get_db_sessions_fatal(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_fatal);
+        result = (int64) (dbentry->counts.n_sessions_fatal);
 
     PG_RETURN_INT64(result);
 }
@@ -1718,7 +1715,7 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
     PgStat_StatDBEntry *dbentry;
 
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) != NULL)
-        result = (int64) (dbentry->n_sessions_killed);
+        result = (int64) (dbentry->counts.n_sessions_killed);
 
     PG_RETURN_INT64(result);
 }
@@ -1726,69 +1723,71 @@ pg_stat_get_db_sessions_killed(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_bgwriter_timed_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->timed_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->timed_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_requested_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->requested_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->requested_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_checkpoints(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_checkpoints);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_checkpoints);
 }
 
 Datum
 pg_stat_get_bgwriter_buf_written_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_written_clean);
 }
 
 Datum
 pg_stat_get_bgwriter_maxwritten_clean(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->maxwritten_clean);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->maxwritten_clean);
 }
 
 Datum
 pg_stat_get_checkpoint_write_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_write_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_write_time);
 }
 
 Datum
 pg_stat_get_checkpoint_sync_time(PG_FUNCTION_ARGS)
 {
     /* time is already in msec, just convert to double for presentation */
-    PG_RETURN_FLOAT8((double) pgstat_fetch_global()->checkpoint_sync_time);
+    PG_RETURN_FLOAT8((double)
+                     pgstat_fetch_stat_checkpointer()->checkpoint_sync_time);
 }
 
 Datum
 pg_stat_get_bgwriter_stat_reset_time(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stat_reset_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_stat_bgwriter()->stat_reset_timestamp);
 }
 
 Datum
 pg_stat_get_buf_written_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_written_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_written_backend);
 }
 
 Datum
 pg_stat_get_buf_fsync_backend(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_fsync_backend);
+    PG_RETURN_INT64(pgstat_fetch_stat_checkpointer()->buf_fsync_backend);
 }
 
 Datum
 pg_stat_get_buf_alloc(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_INT64(pgstat_fetch_global()->buf_alloc);
+    PG_RETURN_INT64(pgstat_fetch_stat_bgwriter()->buf_alloc);
 }
 
 /*
@@ -1802,7 +1801,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     Datum        values[PG_STAT_GET_WAL_COLS];
     bool        nulls[PG_STAT_GET_WAL_COLS];
     char        buf[256];
-    PgStat_WalStats *wal_stats;
+    PgStat_Wal *wal_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -1835,11 +1834,11 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
     wal_stats = pgstat_fetch_stat_wal();
 
     /* Fill values and NULLs */
-    values[0] = Int64GetDatum(wal_stats->wal_records);
-    values[1] = Int64GetDatum(wal_stats->wal_fpi);
+    values[0] = Int64GetDatum(wal_stats->wal_usage.wal_records);
+    values[1] = Int64GetDatum(wal_stats->wal_usage.wal_fpi);
 
     /* Convert to numeric. */
-    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_bytes);
+    snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_usage.wal_bytes);
     values[2] = DirectFunctionCall3(numeric_in,
                                     CStringGetDatum(buf),
                                     ObjectIdGetDatum(0),
@@ -2127,7 +2126,7 @@ pg_stat_get_xact_function_self_time(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_snapshot_timestamp(PG_FUNCTION_ARGS)
 {
-    PG_RETURN_TIMESTAMPTZ(pgstat_fetch_global()->stats_timestamp);
+    PG_RETURN_TIMESTAMPTZ(pgstat_get_stat_timestamp());
 }
 
 /* Discard the active statistics snapshot */
@@ -2215,7 +2214,7 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS)
     TupleDesc    tupdesc;
     Datum        values[7];
     bool        nulls[7];
-    PgStat_ArchiverStats *archiver_stats;
+    PgStat_Archiver *archiver_stats;
 
     /* Initialise values and NULL flags arrays */
     MemSet(values, 0, sizeof(values));
@@ -2285,7 +2284,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     Tuplestorestate *tupstore;
     MemoryContext per_query_ctx;
     MemoryContext oldcontext;
-    PgStat_ReplSlotStats *slotstats;
+    PgStat_ReplSlot *slotstats;
     int            nstats;
     int            i;
 
@@ -2318,7 +2317,7 @@ pg_stat_get_replication_slots(PG_FUNCTION_ARGS)
     {
         Datum        values[PG_STAT_GET_REPLICATION_SLOT_COLS];
         bool        nulls[PG_STAT_GET_REPLICATION_SLOT_COLS];
-        PgStat_ReplSlotStats *s = &(slotstats[i]);
+        PgStat_ReplSlot *s = &(slotstats[i]);
 
         MemSet(values, 0, sizeof(values));
         MemSet(nulls, 0, sizeof(nulls));
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 7ef510cd01..0762c2034c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -72,6 +72,7 @@
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rowsecurity.h"
 #include "storage/lmgr.h"
@@ -2366,6 +2367,10 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)
      */
     RelationCloseSmgr(relation);
 
+    /* break mutual link with stats entry */
+    pgstat_delinkstats(relation);
+
+    if (relation->rd_rel)
     /*
      * Free all the subsidiary data structures of the relcache entry, then the
      * entry itself.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 73e0a672ae..67e15be3c7 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -34,6 +34,7 @@ volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
 volatile sig_atomic_t IdleSessionTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
+volatile sig_atomic_t IdleStatsUpdateTimeoutPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
 volatile uint32 CritSectionCount = 0;
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 8b73850d0d..5c32d747cf 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -271,9 +271,6 @@ GetBackendTypeDesc(BackendType backendType)
         case B_ARCHIVER:
             backendDesc = "archiver";
             break;
-        case B_STATS_COLLECTOR:
-            backendDesc = "stats collector";
-            break;
         case B_LOGGER:
             backendDesc = "logger";
             break;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 7abeccb536..32a22aaa24 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
 static void IdleSessionTimeoutHandler(void);
+static void IdleStatsUpdateTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -620,6 +621,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
         RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
                         IdleInTransactionSessionTimeoutHandler);
         RegisterTimeout(IDLE_SESSION_TIMEOUT, IdleSessionTimeoutHandler);
+        RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+                        IdleStatsUpdateTimeoutHandler);
     }
 
     /*
@@ -1242,6 +1245,14 @@ IdleSessionTimeoutHandler(void)
     SetLatch(MyLatch);
 }
 
+static void
+IdleStatsUpdateTimeoutHandler(void)
+{
+    IdleStatsUpdateTimeoutPending = true;
+    InterruptPending = true;
+    SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b263e3493b..842a3d755d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -746,8 +746,8 @@ const char *const config_group_names[] =
     gettext_noop("Statistics"),
     /* STATS_MONITORING */
     gettext_noop("Statistics / Monitoring"),
-    /* STATS_COLLECTOR */
-    gettext_noop("Statistics / Query and Index Statistics Collector"),
+    /* STATS_ACTIVITY */
+    gettext_noop("Statistics / Query and Index Activity Statistics"),
     /* AUTOVACUUM */
     gettext_noop("Autovacuum"),
     /* CLIENT_CONN */
@@ -1467,7 +1467,7 @@ static struct config_bool ConfigureNamesBool[] =
 #endif
 
     {
-        {"track_activities", PGC_SUSET, STATS_COLLECTOR,
+        {"track_activities", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects information about executing commands."),
             gettext_noop("Enables the collection of information on the currently "
                          "executing command of each session, along with "
@@ -1478,7 +1478,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_counts", PGC_SUSET, STATS_COLLECTOR,
+        {"track_counts", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects statistics on database activity."),
             NULL
         },
@@ -1487,7 +1487,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for database I/O activity."),
             NULL
         },
@@ -1496,7 +1496,7 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
-        {"track_wal_io_timing", PGC_SUSET, STATS_COLLECTOR,
+        {"track_wal_io_timing", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects timing statistics for WAL I/O activity."),
             NULL
         },
@@ -4376,7 +4376,7 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
+        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
             gettext_noop("Writes temporary statistics files to the specified directory."),
             NULL,
             GUC_SUPERUSER_ONLY
@@ -4712,7 +4712,7 @@ static struct config_enum ConfigureNamesEnum[] =
     },
 
     {
-        {"track_functions", PGC_SUSET, STATS_COLLECTOR,
+        {"track_functions", PGC_SUSET, STATS_ACTIVITY,
             gettext_noop("Collects function-level statistics on database activity."),
             NULL
         },
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6647f8fd6e..ffa65d2ad3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -582,7 +582,7 @@
 # STATISTICS
 #------------------------------------------------------------------------------
 
-# - Query and Index Statistics Collector -
+# - Query and Index Activity Statistics -
 
 #track_activities = on
 #track_counts = on
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 019c15c62f..56902a731b 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 110;
+use Test::More tests => 109;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -124,7 +124,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 013850ac28..fe0e4403dc 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -84,6 +84,8 @@ extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleSessionTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
+extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
 
@@ -320,7 +322,6 @@ typedef enum BackendType
     B_WAL_SENDER,
     B_WAL_WRITER,
     B_ARCHIVER,
-    B_STATS_COLLECTOR,
     B_LOGGER,
 } BackendType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be43c04802..f4177eb284 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1,7 +1,7 @@
 /* ----------
  *    pgstat.h
  *
- *    Definitions for the PostgreSQL statistics collector daemon.
+ *    Definitions for the PostgreSQL activity statistics facility.
  *
  *    Copyright (c) 2001-2021, PostgreSQL Global Development Group
  *
@@ -12,12 +12,15 @@
 #define PGSTAT_H
 
 #include "datatype/timestamp.h"
+#include "executor/instrument.h"
 #include "libpq/pqcomm.h"
 #include "miscadmin.h"
 #include "port/atomics.h"
+#include "lib/dshash.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"
 #include "storage/proc.h"
+#include "storage/lwlock.h"
 #include "utils/hsearch.h"
 #include "utils/relcache.h"
 
@@ -27,11 +30,11 @@
  * ----------
  */
 #define PGSTAT_STAT_PERMANENT_DIRECTORY        "pg_stat"
-#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
+#define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
 /* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
+#define PG_STAT_TMP_DIR                "pg_stat_tmp"
 
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
@@ -51,39 +54,6 @@ typedef enum SessionEndType
     DISCONNECT_KILLED
 } SessionEndType;
 
-/* ----------
- * The types of backend -> collector messages
- * ----------
- */
-typedef enum StatMsgType
-{
-    PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
-    PGSTAT_MTYPE_TABSTAT,
-    PGSTAT_MTYPE_TABPURGE,
-    PGSTAT_MTYPE_DROPDB,
-    PGSTAT_MTYPE_RESETCOUNTER,
-    PGSTAT_MTYPE_RESETSHAREDCOUNTER,
-    PGSTAT_MTYPE_RESETSINGLECOUNTER,
-    PGSTAT_MTYPE_RESETSLRUCOUNTER,
-    PGSTAT_MTYPE_RESETREPLSLOTCOUNTER,
-    PGSTAT_MTYPE_AUTOVAC_START,
-    PGSTAT_MTYPE_VACUUM,
-    PGSTAT_MTYPE_ANALYZE,
-    PGSTAT_MTYPE_ARCHIVER,
-    PGSTAT_MTYPE_BGWRITER,
-    PGSTAT_MTYPE_WAL,
-    PGSTAT_MTYPE_SLRU,
-    PGSTAT_MTYPE_FUNCSTAT,
-    PGSTAT_MTYPE_FUNCPURGE,
-    PGSTAT_MTYPE_RECOVERYCONFLICT,
-    PGSTAT_MTYPE_TEMPFILE,
-    PGSTAT_MTYPE_DEADLOCK,
-    PGSTAT_MTYPE_CHECKSUMFAILURE,
-    PGSTAT_MTYPE_REPLSLOT,
-    PGSTAT_MTYPE_CONNECTION,
-} StatMsgType;
-
 /* ----------
  * The data type used for counters.
  * ----------
@@ -94,9 +64,8 @@ typedef int64 PgStat_Counter;
  * PgStat_TableCounts            The actual per-table counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
- * It is a component of PgStat_TableStatus (within-backend state) and
- * PgStat_TableEntry (the transmitted message format).
+ * it against zeroes to detect whether there are any counts to write.
+ * It is a component of PgStat_TableStatus (within-backend state).
  *
  * Note: for a table, tuples_returned is the number of tuples successfully
  * fetched by heap_getnext, while tuples_fetched is the number of tuples
@@ -170,10 +139,13 @@ typedef enum PgStat_Single_Reset_Type
  */
 typedef struct PgStat_TableStatus
 {
-    Oid            t_id;            /* table's OID */
-    bool        t_shared;        /* is it a shared catalog? */
     struct PgStat_TableXactStatus *trans;    /* lowest subxact's counts */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_TableCounts t_counts;    /* event counts to be sent */
+    Relation    relation;            /* rel that is using this entry */
 } PgStat_TableStatus;
 
 /* ----------
@@ -197,359 +169,57 @@ typedef struct PgStat_TableXactStatus
     struct PgStat_TableXactStatus *next;    /* next of same subxact */
 } PgStat_TableXactStatus;
 
-
-/* ------------------------------------------------------------
- * Message formats follow
- * ------------------------------------------------------------
- */
-
-
-/* ----------
- * PgStat_MsgHdr                The common message header
- * ----------
+/*
+ * Archiver statistics kept in the shared stats
  */
-typedef struct PgStat_MsgHdr
+typedef struct PgStat_Archiver
 {
-    StatMsgType m_type;
-    int            m_size;
-} PgStat_MsgHdr;
-
-/* ----------
- * Space available in a message.  This will keep the UDP packets below 1K,
- * which should fit unfragmented into the MTU of the loopback interface.
- * (Larger values of PGSTAT_MAX_MSG_SIZE would work for that on most
- * platforms, but we're being conservative here.)
- * ----------
- */
-#define PGSTAT_MAX_MSG_SIZE 1000
-#define PGSTAT_MSG_PAYLOAD    (PGSTAT_MAX_MSG_SIZE - sizeof(PgStat_MsgHdr))
-
+    PgStat_Counter archived_count;    /* archival successes */
+    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
+                                                         * archived */
+    TimestampTz last_archived_timestamp;    /* last archival success time */
+    PgStat_Counter failed_count;    /* failed archival attempts */
+    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
+                                                     * last failure */
+    TimestampTz last_failed_timestamp;    /* last archival failure time */
+    TimestampTz stat_reset_timestamp;
+} PgStat_Archiver;
 
 /* ----------
- * PgStat_MsgDummy                A dummy message, ignored by the collector
+ * PgStat_BgWriter            bgwriter statistics
  * ----------
  */
-typedef struct PgStat_MsgDummy
+typedef struct PgStat_BgWriter
 {
-    PgStat_MsgHdr m_hdr;
-} PgStat_MsgDummy;
-
+    PgStat_Counter buf_written_clean;
+    PgStat_Counter maxwritten_clean;
+    PgStat_Counter buf_alloc;
+    TimestampTz       stat_reset_timestamp;
+} PgStat_BgWriter;
 
 /* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
+ * PgStat_CheckPointer        checkpointer statistics
  * ----------
  */
-
-typedef struct PgStat_MsgInquiry
+typedef struct PgStat_CheckPointer
 {
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
-/* ----------
- * PgStat_TableEntry            Per-table info in a MsgTabstat
- * ----------
- */
-typedef struct PgStat_TableEntry
-{
-    Oid            t_id;
-    PgStat_TableCounts t_counts;
-} PgStat_TableEntry;
-
-/* ----------
- * PgStat_MsgTabstat            Sent by the backend to report table
- *                                and buffer access statistics.
- * ----------
- */
-#define PGSTAT_NUM_TABENTRIES  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - 3 * sizeof(int) - 2 * sizeof(PgStat_Counter))    \
-     / sizeof(PgStat_TableEntry))
-
-typedef struct PgStat_MsgTabstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    int            m_xact_commit;
-    int            m_xact_rollback;
-    PgStat_Counter m_block_read_time;    /* times in microseconds */
-    PgStat_Counter m_block_write_time;
-    PgStat_TableEntry m_entry[PGSTAT_NUM_TABENTRIES];
-} PgStat_MsgTabstat;
-
-
-/* ----------
- * PgStat_MsgTabpurge            Sent by the backend to tell the collector
- *                                about dead tables.
- * ----------
- */
-#define PGSTAT_NUM_TABPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgTabpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_tableid[PGSTAT_NUM_TABPURGE];
-} PgStat_MsgTabpurge;
-
-
-/* ----------
- * PgStat_MsgDropdb                Sent by the backend to tell the collector
- *                                about a dropped database
- * ----------
- */
-typedef struct PgStat_MsgDropdb
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDropdb;
-
-
-/* ----------
- * PgStat_MsgResetcounter        Sent by the backend to tell the collector
- *                                to reset counters
- * ----------
- */
-typedef struct PgStat_MsgResetcounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgResetcounter;
-
-/* ----------
- * PgStat_MsgResetsharedcounter Sent by the backend to tell the collector
- *                                to reset a shared counter
- * ----------
- */
-typedef struct PgStat_MsgResetsharedcounter
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Shared_Reset_Target m_resettarget;
-} PgStat_MsgResetsharedcounter;
-
-/* ----------
- * PgStat_MsgResetsinglecounter Sent by the backend to tell the collector
- *                                to reset a single counter
- * ----------
- */
-typedef struct PgStat_MsgResetsinglecounter
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Single_Reset_Type m_resettype;
-    Oid            m_objectid;
-} PgStat_MsgResetsinglecounter;
-
-/* ----------
- * PgStat_MsgResetslrucounter Sent by the backend to tell the collector
- *                                to reset a SLRU counter
- * ----------
- */
-typedef struct PgStat_MsgResetslrucounter
-{
-    PgStat_MsgHdr m_hdr;
-    int            m_index;
-} PgStat_MsgResetslrucounter;
-
-/* ----------
- * PgStat_MsgResetreplslotcounter Sent by the backend to tell the collector
- *                                to reset replication slot counter(s)
- * ----------
- */
-typedef struct PgStat_MsgResetreplslotcounter
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        clearall;
-} PgStat_MsgResetreplslotcounter;
-
-/* ----------
- * PgStat_MsgAutovacStart        Sent by the autovacuum daemon to signal
- *                                that a database is going to be processed
- * ----------
- */
-typedef struct PgStat_MsgAutovacStart
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    TimestampTz m_start_time;
-} PgStat_MsgAutovacStart;
-
-
-/* ----------
- * PgStat_MsgVacuum                Sent by the backend or autovacuum daemon
- *                                after VACUUM
- * ----------
- */
-typedef struct PgStat_MsgVacuum
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    TimestampTz m_vacuumtime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgVacuum;
-
-
-/* ----------
- * PgStat_MsgAnalyze            Sent by the backend or autovacuum daemon
- *                                after ANALYZE
- * ----------
- */
-typedef struct PgStat_MsgAnalyze
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    Oid            m_tableoid;
-    bool        m_autovacuum;
-    bool        m_resetcounter;
-    TimestampTz m_analyzetime;
-    PgStat_Counter m_live_tuples;
-    PgStat_Counter m_dead_tuples;
-} PgStat_MsgAnalyze;
-
-
-/* ----------
- * PgStat_MsgArchiver            Sent by the archiver to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgArchiver
-{
-    PgStat_MsgHdr m_hdr;
-    bool        m_failed;        /* Failed attempt */
-    char        m_xlog[MAX_XFN_CHARS + 1];
-    TimestampTz m_timestamp;
-} PgStat_MsgArchiver;
-
-/* ----------
- * PgStat_MsgBgWriter            Sent by the bgwriter to update statistics.
- * ----------
- */
-typedef struct PgStat_MsgBgWriter
-{
-    PgStat_MsgHdr m_hdr;
-
-    PgStat_Counter m_timed_checkpoints;
-    PgStat_Counter m_requested_checkpoints;
-    PgStat_Counter m_buf_written_checkpoints;
-    PgStat_Counter m_buf_written_clean;
-    PgStat_Counter m_maxwritten_clean;
-    PgStat_Counter m_buf_written_backend;
-    PgStat_Counter m_buf_fsync_backend;
-    PgStat_Counter m_buf_alloc;
-    PgStat_Counter m_checkpoint_write_time; /* times in milliseconds */
-    PgStat_Counter m_checkpoint_sync_time;
-} PgStat_MsgBgWriter;
-
-/* ----------
- * PgStat_MsgWal            Sent by backends and background processes to update WAL statistics.
- * ----------
- */
-typedef struct PgStat_MsgWal
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_wal_records;
-    PgStat_Counter m_wal_fpi;
-    uint64        m_wal_bytes;
-    PgStat_Counter m_wal_buffers_full;
-    PgStat_Counter m_wal_write;
-    PgStat_Counter m_wal_sync;
-    PgStat_Counter m_wal_write_time;    /* time spent writing wal records in
-                                         * microseconds */
-    PgStat_Counter m_wal_sync_time; /* time spent syncing wal records in
-                                     * microseconds */
-} PgStat_MsgWal;
-
-/* ----------
- * PgStat_MsgSLRU            Sent by a backend to update SLRU statistics.
- * ----------
- */
-typedef struct PgStat_MsgSLRU
-{
-    PgStat_MsgHdr m_hdr;
-    PgStat_Counter m_index;
-    PgStat_Counter m_blocks_zeroed;
-    PgStat_Counter m_blocks_hit;
-    PgStat_Counter m_blocks_read;
-    PgStat_Counter m_blocks_written;
-    PgStat_Counter m_blocks_exists;
-    PgStat_Counter m_flush;
-    PgStat_Counter m_truncate;
-} PgStat_MsgSLRU;
-
-/* ----------
- * PgStat_MsgReplSlot    Sent by a backend or a wal sender to update replication
- *                        slot statistics.
- * ----------
- */
-typedef struct PgStat_MsgReplSlot
-{
-    PgStat_MsgHdr m_hdr;
-    char        m_slotname[NAMEDATALEN];
-    bool        m_drop;
-    PgStat_Counter m_spill_txns;
-    PgStat_Counter m_spill_count;
-    PgStat_Counter m_spill_bytes;
-    PgStat_Counter m_stream_txns;
-    PgStat_Counter m_stream_count;
-    PgStat_Counter m_stream_bytes;
-} PgStat_MsgReplSlot;
-
-
-/* ----------
- * PgStat_MsgRecoveryConflict    Sent by the backend upon recovery conflict
- * ----------
- */
-typedef struct PgStat_MsgRecoveryConflict
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    int            m_reason;
-} PgStat_MsgRecoveryConflict;
-
-/* ----------
- * PgStat_MsgTempFile    Sent by the backend upon creating a temp file
- * ----------
- */
-typedef struct PgStat_MsgTempFile
-{
-    PgStat_MsgHdr m_hdr;
-
-    Oid            m_databaseid;
-    size_t        m_filesize;
-} PgStat_MsgTempFile;
+    PgStat_Counter timed_checkpoints;
+    PgStat_Counter requested_checkpoints;
+    PgStat_Counter buf_written_checkpoints;
+    PgStat_Counter buf_written_backend;
+    PgStat_Counter buf_fsync_backend;
+    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
+    PgStat_Counter checkpoint_sync_time;
+} PgStat_CheckPointer;
 
 /* ----------
  * PgStat_FunctionCounts    The actual per-function counts kept by a backend
  *
  * This struct should contain only actual event counters, because we memcmp
- * it against zeroes to detect whether there are any counts to transmit.
+ * it against zeroes to detect whether there are any counts to write.
  *
  * Note that the time counters are in instr_time format here.  We convert to
- * microseconds in PgStat_Counter format when transmitting to the collector.
+ * microseconds in PgStat_Counter format when writing to shared statsitics.
  * ----------
  */
 typedef struct PgStat_FunctionCounts
@@ -565,7 +235,6 @@ typedef struct PgStat_FunctionCounts
  */
 typedef struct PgStat_BackendFunctionEntry
 {
-    Oid            f_id;
     PgStat_FunctionCounts f_counts;
 } PgStat_BackendFunctionEntry;
 
@@ -581,117 +250,8 @@ typedef struct PgStat_FunctionEntry
     PgStat_Counter f_self_time;
 } PgStat_FunctionEntry;
 
-/* ----------
- * PgStat_MsgFuncstat            Sent by the backend to report function
- *                                usage statistics.
- * ----------
- */
-#define PGSTAT_NUM_FUNCENTRIES    \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(PgStat_FunctionEntry))
-
-typedef struct PgStat_MsgFuncstat
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    PgStat_FunctionEntry m_entry[PGSTAT_NUM_FUNCENTRIES];
-} PgStat_MsgFuncstat;
-
-/* ----------
- * PgStat_MsgFuncpurge            Sent by the backend to tell the collector
- *                                about dead functions.
- * ----------
- */
-#define PGSTAT_NUM_FUNCPURGE  \
-    ((PGSTAT_MSG_PAYLOAD - sizeof(Oid) - sizeof(int))  \
-     / sizeof(Oid))
-
-typedef struct PgStat_MsgFuncpurge
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_nentries;
-    Oid            m_functionid[PGSTAT_NUM_FUNCPURGE];
-} PgStat_MsgFuncpurge;
-
-/* ----------
- * PgStat_MsgDeadlock            Sent by the backend to tell the collector
- *                                about a deadlock that occurred.
- * ----------
- */
-typedef struct PgStat_MsgDeadlock
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-} PgStat_MsgDeadlock;
-
-/* ----------
- * PgStat_MsgChecksumFailure    Sent by the backend to tell the collector
- *                                about checksum failures noticed.
- * ----------
- */
-typedef struct PgStat_MsgChecksumFailure
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    int            m_failurecount;
-    TimestampTz m_failure_time;
-} PgStat_MsgChecksumFailure;
-
-/* ----------
- * PgStat_MsgConn            Sent by the backend to update connection statistics.
- * ----------
- */
-typedef struct PgStat_MsgConn
-{
-    PgStat_MsgHdr m_hdr;
-    Oid            m_databaseid;
-    PgStat_Counter m_count;
-    PgStat_Counter m_session_time;
-    PgStat_Counter m_active_time;
-    PgStat_Counter m_idle_in_xact_time;
-    SessionEndType m_disconnect;
-} PgStat_MsgConn;
-
-
-/* ----------
- * PgStat_Msg                    Union over all possible messages.
- * ----------
- */
-typedef union PgStat_Msg
-{
-    PgStat_MsgHdr msg_hdr;
-    PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
-    PgStat_MsgTabstat msg_tabstat;
-    PgStat_MsgTabpurge msg_tabpurge;
-    PgStat_MsgDropdb msg_dropdb;
-    PgStat_MsgResetcounter msg_resetcounter;
-    PgStat_MsgResetsharedcounter msg_resetsharedcounter;
-    PgStat_MsgResetsinglecounter msg_resetsinglecounter;
-    PgStat_MsgResetslrucounter msg_resetslrucounter;
-    PgStat_MsgResetreplslotcounter msg_resetreplslotcounter;
-    PgStat_MsgAutovacStart msg_autovacuum_start;
-    PgStat_MsgVacuum msg_vacuum;
-    PgStat_MsgAnalyze msg_analyze;
-    PgStat_MsgArchiver msg_archiver;
-    PgStat_MsgBgWriter msg_bgwriter;
-    PgStat_MsgWal msg_wal;
-    PgStat_MsgSLRU msg_slru;
-    PgStat_MsgFuncstat msg_funcstat;
-    PgStat_MsgFuncpurge msg_funcpurge;
-    PgStat_MsgRecoveryConflict msg_recoveryconflict;
-    PgStat_MsgDeadlock msg_deadlock;
-    PgStat_MsgTempFile msg_tempfile;
-    PgStat_MsgChecksumFailure msg_checksumfailure;
-    PgStat_MsgReplSlot msg_replslot;
-    PgStat_MsgConn msg_conn;
-} PgStat_Msg;
-
-
 /* ------------------------------------------------------------
- * Statistic collector data structures follow
+ * Activity statistics data structures on file and shared memory follow
  *
  * PGSTAT_FILE_FORMAT_ID should be changed whenever any of these
  * data structures change.
@@ -700,13 +260,9 @@ typedef union PgStat_Msg
 
 #define PGSTAT_FILE_FORMAT_ID    0x01A5BCA1
 
-/* ----------
- * PgStat_StatDBEntry            The collector's data per database
- * ----------
- */
-typedef struct PgStat_StatDBEntry
+
+typedef struct PgStat_StatDBCounts
 {
-    Oid            databaseid;
     PgStat_Counter n_xact_commit;
     PgStat_Counter n_xact_rollback;
     PgStat_Counter n_blocks_fetched;
@@ -716,7 +272,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_tuples_inserted;
     PgStat_Counter n_tuples_updated;
     PgStat_Counter n_tuples_deleted;
-    TimestampTz last_autovac_time;
     PgStat_Counter n_conflict_tablespace;
     PgStat_Counter n_conflict_lock;
     PgStat_Counter n_conflict_snapshot;
@@ -726,7 +281,6 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_temp_bytes;
     PgStat_Counter n_deadlocks;
     PgStat_Counter n_checksum_failures;
-    TimestampTz last_checksum_failure;
     PgStat_Counter n_block_read_time;    /* times in microseconds */
     PgStat_Counter n_block_write_time;
     PgStat_Counter n_sessions;
@@ -736,26 +290,88 @@ typedef struct PgStat_StatDBEntry
     PgStat_Counter n_sessions_abandoned;
     PgStat_Counter n_sessions_fatal;
     PgStat_Counter n_sessions_killed;
+} PgStat_StatDBCounts;
 
+/* ----------
+ * PgStat_StatEntryHead            common header struct for PgStat_Stat*Entry
+ * ----------
+ */
+typedef struct PgStat_StatEntryHeader
+{
+    LWLock        lock;
+    bool        dropped;            /* This entry is being dropped and should
+                                     * be removed when refcount goes to
+                                     * zero. */
+    pg_atomic_uint32  refcount;        /* How many backends are referenceing */
+} PgStat_StatEntryHeader;
+
+/* ----------
+ * PgStat_StatDBEntry            The statistics per database
+ * ----------
+ */
+typedef struct PgStat_StatDBEntry
+{
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    TimestampTz last_autovac_time;
+    TimestampTz last_checksum_failure;
     TimestampTz stat_reset_timestamp;
-    TimestampTz stats_timestamp;    /* time of db stats file update */
+    TimestampTz stats_timestamp;    /* time of db stats update */
+
+    PgStat_StatDBCounts counts;
 
     /*
-     * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * The followings must be last in the struct, because we don't write them
+     * out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables; /* current gen tables hash */
+    dshash_table_handle functions;    /* current gen functions hash */
 } PgStat_StatDBEntry;
 
+/* ----------
+ * PgStat_Wal   Sent by backends and background processes to
+ *                update WAL statistics.
+ * ----------
+ */
+typedef struct PgStat_Wal
+{
+    WalUsage       wal_usage;
+    PgStat_Counter wal_buffers_full;
+    PgStat_Counter wal_write;
+    PgStat_Counter wal_sync;
+    PgStat_Counter wal_write_time;
+    PgStat_Counter wal_sync_time;
+    TimestampTz stat_reset_timestamp;
+} PgStat_Wal;
+
+/*
+ * SLRU statistics kept in the stats collector
+ */
+typedef struct PgStat_SLRUStats
+{
+    PgStat_Counter blocks_zeroed;
+    PgStat_Counter blocks_hit;
+    PgStat_Counter blocks_read;
+    PgStat_Counter blocks_written;
+    PgStat_Counter blocks_exists;
+    PgStat_Counter flush;
+    PgStat_Counter truncate;
+    TimestampTz stat_reset_timestamp;
+} PgStat_SLRUStats;
 
 /* ----------
- * PgStat_StatTabEntry            The collector's data per table (or index)
+ * PgStat_StatTabEntry            The statistics per table (or index)
  * ----------
  */
 typedef struct PgStat_StatTabEntry
 {
-    Oid            tableid;
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
+    TimestampTz analyze_timestamp;    /* user initiated */
+    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
 
     PgStat_Counter numscans;
 
@@ -775,103 +391,35 @@ typedef struct PgStat_StatTabEntry
     PgStat_Counter blocks_fetched;
     PgStat_Counter blocks_hit;
 
-    TimestampTz vacuum_timestamp;    /* user initiated vacuum */
     PgStat_Counter vacuum_count;
-    TimestampTz autovac_vacuum_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_vacuum_count;
-    TimestampTz analyze_timestamp;    /* user initiated */
     PgStat_Counter analyze_count;
-    TimestampTz autovac_analyze_timestamp;    /* autovacuum initiated */
     PgStat_Counter autovac_analyze_count;
 } PgStat_StatTabEntry;
 
 
 /* ----------
- * PgStat_StatFuncEntry            The collector's data per function
+ * PgStat_StatFuncEntry            per function stats data
  * ----------
  */
 typedef struct PgStat_StatFuncEntry
 {
-    Oid            functionid;
-
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
+    /* Persistent data follow */
     PgStat_Counter f_numcalls;
 
     PgStat_Counter f_total_time;    /* times in microseconds */
     PgStat_Counter f_self_time;
 } PgStat_StatFuncEntry;
 
-
-/*
- * Archiver statistics kept in the stats collector
- */
-typedef struct PgStat_ArchiverStats
-{
-    PgStat_Counter archived_count;    /* archival successes */
-    char        last_archived_wal[MAX_XFN_CHARS + 1];    /* last WAL file
-                                                         * archived */
-    TimestampTz last_archived_timestamp;    /* last archival success time */
-    PgStat_Counter failed_count;    /* failed archival attempts */
-    char        last_failed_wal[MAX_XFN_CHARS + 1]; /* WAL file involved in
-                                                     * last failure */
-    TimestampTz last_failed_timestamp;    /* last archival failure time */
-    TimestampTz stat_reset_timestamp;
-} PgStat_ArchiverStats;
-
-/*
- * Global statistics kept in the stats collector
- */
-typedef struct PgStat_GlobalStats
-{
-    TimestampTz stats_timestamp;    /* time of stats file update */
-    PgStat_Counter timed_checkpoints;
-    PgStat_Counter requested_checkpoints;
-    PgStat_Counter checkpoint_write_time;    /* times in milliseconds */
-    PgStat_Counter checkpoint_sync_time;
-    PgStat_Counter buf_written_checkpoints;
-    PgStat_Counter buf_written_clean;
-    PgStat_Counter maxwritten_clean;
-    PgStat_Counter buf_written_backend;
-    PgStat_Counter buf_fsync_backend;
-    PgStat_Counter buf_alloc;
-    TimestampTz stat_reset_timestamp;
-} PgStat_GlobalStats;
-
-/*
- * WAL statistics kept in the stats collector
- */
-typedef struct PgStat_WalStats
-{
-    PgStat_Counter wal_records;
-    PgStat_Counter wal_fpi;
-    uint64        wal_bytes;
-    PgStat_Counter wal_buffers_full;
-    PgStat_Counter wal_write;
-    PgStat_Counter wal_sync;
-    PgStat_Counter wal_write_time;
-    PgStat_Counter wal_sync_time;
-    TimestampTz stat_reset_timestamp;
-} PgStat_WalStats;
-
-/*
- * SLRU statistics kept in the stats collector
- */
-typedef struct PgStat_SLRUStats
-{
-    PgStat_Counter blocks_zeroed;
-    PgStat_Counter blocks_hit;
-    PgStat_Counter blocks_read;
-    PgStat_Counter blocks_written;
-    PgStat_Counter blocks_exists;
-    PgStat_Counter flush;
-    PgStat_Counter truncate;
-    TimestampTz stat_reset_timestamp;
-} PgStat_SLRUStats;
-
 /*
  * Replication slot statistics kept in the stats collector
  */
-typedef struct PgStat_ReplSlotStats
+typedef struct PgStat_ReplSlot
 {
+    PgStat_StatEntryHeader header;    /* must be the first member,
+                                       used only on shared memory  */
     char        slotname[NAMEDATALEN];
     PgStat_Counter spill_txns;
     PgStat_Counter spill_count;
@@ -880,7 +428,7 @@ typedef struct PgStat_ReplSlotStats
     PgStat_Counter stream_count;
     PgStat_Counter stream_bytes;
     TimestampTz stat_reset_timestamp;
-} PgStat_ReplSlotStats;
+} PgStat_ReplSlot;
 
 /* ----------
  * Backend states
@@ -929,7 +477,7 @@ typedef enum
     WAIT_EVENT_CHECKPOINTER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
-    WAIT_EVENT_PGSTAT_MAIN,
+    WAIT_EVENT_READING_STATS_FILE,
     WAIT_EVENT_RECOVERY_WAL_STREAM,
     WAIT_EVENT_SYSLOGGER_MAIN,
     WAIT_EVENT_WAL_RECEIVER_MAIN,
@@ -1183,7 +731,7 @@ typedef struct PgBackendGSSStatus
  *
  * Each live backend maintains a PgBackendStatus struct in shared memory
  * showing its current activity.  (The structs are allocated according to
- * BackendId, but that is not critical.)  Note that the collector process
+ * BackendId, but that is not critical.)  Note that the stats-writing functions
  * has no involvement in, or even access to, these structs.
  *
  * Each auxiliary process also maintains a PgBackendStatus struct in shared
@@ -1380,18 +928,26 @@ extern PGDLLIMPORT bool pgstat_track_counts;
 extern PGDLLIMPORT int pgstat_track_functions;
 extern PGDLLIMPORT int pgstat_track_activity_query_size;
 extern char *pgstat_stat_directory;
+
+/* No longer used, but will be removed with GUC */
 extern char *pgstat_stat_tmpname;
 extern char *pgstat_stat_filename;
 
 /*
  * BgWriter statistics counters are updated directly by bgwriter and bufmgr
  */
-extern PgStat_MsgBgWriter BgWriterStats;
+extern PgStat_BgWriter BgWriterStats;
+
+/*
+ * CheckPointer statistics counters are updated directly by checkpointer and
+ * bufmgr
+ */
+extern PgStat_CheckPointer CheckPointerStats;
 
 /*
  * WAL statistics counter is updated by backends and background processes
  */
-extern PgStat_MsgWal WalStats;
+extern PgStat_Wal WalStats;
 
 /*
  * Updated by pgstat_count_buffer_*_time macros
@@ -1411,33 +967,27 @@ extern SessionEndType pgStatSessionEndCause;
 extern Size BackendStatusShmemSize(void);
 extern void CreateSharedBackendStatus(void);
 
-extern void pgstat_init(void);
-extern int    pgstat_start(void);
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
+
 extern void pgstat_reset_all(void);
-extern void allow_immediate_pgstat_restart(void);
-
-#ifdef EXEC_BACKEND
-extern void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-#endif
-
 
 /* ----------
  * Functions called from backends
  * ----------
  */
-extern void pgstat_ping(void);
-
-extern void pgstat_report_stat(bool force);
+extern long pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
-extern void pgstat_reset_shared_counters(const char *);
+extern void pgstat_reset_shared_counters(const char *target);
 extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type type);
 extern void pgstat_reset_slru_counter(const char *);
 extern void pgstat_reset_replslot_counter(const char *name);
 
+extern void pgstat_report_wal(void);
 extern void pgstat_report_autovac(Oid dboid);
 extern void pgstat_report_vacuum(Oid tableoid, bool shared,
                                  PgStat_Counter livetuples, PgStat_Counter deadtuples);
@@ -1477,6 +1027,7 @@ extern PgStat_TableStatus *find_tabstat_entry(Oid rel_id);
 extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
+extern void pgstat_delinkstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
 
@@ -1599,10 +1150,9 @@ extern void pgstat_twophase_postcommit(TransactionId xid, uint16 info,
 extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
                                       void *recdata, uint32 len);
 
-extern void pgstat_send_archiver(const char *xlog, bool failed);
-extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_report_archiver(const char *xlog, bool failed);
+extern void pgstat_report_bgwriter(void);
+extern void pgstat_report_checkpointer(void);
 
 /* ----------
  * Support functions for the SQL-callable functions to
@@ -1611,15 +1161,20 @@ extern bool pgstat_send_wal(bool force);
  */
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
+extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry_extended(bool shared,
+                                                                Oid relid);
+extern void pgstat_copy_index_counters(Oid relid, PgStat_TableStatus *dst);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
 extern LocalPgBackendStatus *pgstat_fetch_stat_local_beentry(int beid);
 extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid);
 extern int    pgstat_fetch_stat_numbackends(void);
-extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void);
-extern PgStat_GlobalStats *pgstat_fetch_global(void);
-extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern TimestampTz pgstat_get_stat_timestamp(void);
+extern PgStat_Archiver *pgstat_fetch_stat_archiver(void);
+extern PgStat_BgWriter *pgstat_fetch_stat_bgwriter(void);
+extern PgStat_CheckPointer *pgstat_fetch_stat_checkpointer(void);
+extern PgStat_Wal *pgstat_fetch_stat_wal(void);
 extern PgStat_SLRUStats *pgstat_fetch_slru(void);
-extern PgStat_ReplSlotStats *pgstat_fetch_replslot(int *nslots_p);
+extern PgStat_ReplSlot *pgstat_fetch_replslot(int *nslots_p);
 
 extern void pgstat_count_slru_page_zeroed(int slru_idx);
 extern void pgstat_count_slru_page_hit(int slru_idx);
@@ -1630,5 +1185,6 @@ extern void pgstat_count_slru_flush(int slru_idx);
 extern void pgstat_count_slru_truncate(int slru_idx);
 extern const char *pgstat_slru_name(int slru_idx);
 extern int    pgstat_slru_index(const char *name);
+extern void pgstat_clear_snapshot(void);
 
 #endif                            /* PGSTAT_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a8f052e484..aabc6d8b4d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -189,6 +189,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SHARED_TIDBITMAP,
     LWTRANCHE_PARALLEL_APPEND,
     LWTRANCHE_PER_XACT_PREDICATE_LIST,
+    LWTRANCHE_STATS,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index b9b5c1adda..add9c53ee3 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -88,7 +88,7 @@ enum config_group
     PROCESS_TITLE,
     STATS,
     STATS_MONITORING,
-    STATS_COLLECTOR,
+    STATS_ACTIVITY,
     AUTOVACUUM,
     CLIENT_CONN,
     CLIENT_CONN_STATEMENT,
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index ecb2a366a5..f090f7372a 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -32,6 +32,7 @@ typedef enum TimeoutId
     STANDBY_LOCK_TIMEOUT,
     IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
     IDLE_SESSION_TIMEOUT,
+    IDLE_STATS_UPDATE_TIMEOUT,
     /* First user-definable timeout reason */
     USER_TIMEOUT,
     /* Maximum number of timeout reasons */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 58ad2991a7..653ffde30e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1837,52 +1837,33 @@ PgFdwPathExtraData
 PgFdwRelationInfo
 PgFdwScanState
 PgIfAddrCallback
-PgStat_ArchiverStats
+PgStat_Archiver
+PgStat_BgWriter
+PgStat_CheckPointer
+PgStat_ReplSlot
+PgStat_SLRUStats
+PgStat_StatDBCounts
+PgStat_StatDBEntry
+PgStat_StatEntryHeader
+PgStat_Wal
 PgStat_BackendFunctionEntry
 PgStat_Counter
 PgStat_FunctionCallUsage
 PgStat_FunctionCounts
 PgStat_FunctionEntry
-PgStat_GlobalStats
-PgStat_Msg
-PgStat_MsgAnalyze
-PgStat_MsgArchiver
-PgStat_MsgAutovacStart
-PgStat_MsgBgWriter
-PgStat_MsgChecksumFailure
-PgStat_MsgDeadlock
-PgStat_MsgDropdb
-PgStat_MsgDummy
-PgStat_MsgFuncpurge
-PgStat_MsgFuncstat
-PgStat_MsgHdr
-PgStat_MsgInquiry
-PgStat_MsgRecoveryConflict
-PgStat_MsgReplSlot
-PgStat_MsgResetcounter
-PgStat_MsgResetreplslotcounter
-PgStat_MsgResetsharedcounter
-PgStat_MsgResetsinglecounter
-PgStat_MsgResetslrucounter
-PgStat_MsgSLRU
-PgStat_MsgTabpurge
-PgStat_MsgTabstat
-PgStat_MsgTempFile
-PgStat_MsgVacuum
-PgStat_MsgWal
-PgStat_ReplSlotStats
-PgStat_SLRUStats
 PgStat_Shared_Reset_Target
 PgStat_Single_Reset_Type
-PgStat_StatDBEntry
 PgStat_StatFuncEntry
 PgStat_StatTabEntry
 PgStat_SubXactStatus
 PgStat_TableCounts
-PgStat_TableEntry
 PgStat_TableStatus
 PgStat_TableXactStatus
-PgStat_WalStats
+PgStatHashEntry
+PgStatHashKey
+PgStatLocalHashEntry
+PgStatSharedSLRUStats
+PgStatTypes
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
@@ -2396,12 +2377,12 @@ StartupPacket
 StartupStatusEnum
 StatEntry
 StatExtEntry
-StatMsgType
 StateFileChunk
 StatisticExtInfo
 Stats
 StatsData
 StatsExtInfo
+StatsShmemStruct
 StdAnalyzeData
 StdRdOptions
 Step
@@ -2499,8 +2480,6 @@ TXNEntryFile
 TYPCATEGORY
 T_Action
 T_WorkerStatus
-TabStatHashEntry
-TabStatusArray
 TableAmRoutine
 TableDataInfo
 TableFunc
@@ -3257,6 +3236,7 @@ pgssLocationLen
 pgssSharedState
 pgssStoreKind
 pgssVersion
+pgstat_oident
 pgstat_page
 pgstattuple_type
 pgthreadlock_t
-- 
2.27.0

From f32e735c1104617086301882b945a4dc4d91e0f3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 19 Mar 2020 15:11:09 +0900
Subject: [PATCH v57 4/6] Doc part of shared-memory based stats collector.

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  25 +++---
 doc/src/sgml/high-availability.sgml |  13 +--
 doc/src/sgml/monitoring.sgml        | 127 +++++++++++++---------------
 doc/src/sgml/ref/pg_dump.sgml       |   9 +-
 src/backend/postmaster/postmaster.c |   2 -
 6 files changed, 85 insertions(+), 97 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5c9f4af1d5..795c042a3d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9271,9 +9271,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8603cf3f94..99a55d276d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7398,11 +7398,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
     <title>Run-time Statistics</title>
 
     <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics features.
+      When activity statistics is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7418,14 +7418,13 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
-        executing command of each session, along with the time when
-        that command began execution. This parameter is on by
-        default. Note that even when enabled, this information is not
-        visible to all users, only to superusers and the user owning
-        the session being reported on, so it should not represent a
-        security risk.
-        Only superusers can change this setting.
+        Enables activity tracking on the currently executing command of
+        each session, along with the time when that command began
+        execution. This parameter is on by default. Note that even when
+        enabled, this information is not visible to all users, only to
+        superusers and the user owning the session being reported on, so it
+        should not represent a security risk.  Only superusers can change this
+        setting.
        </para>
       </listitem>
      </varlistentry>
@@ -7456,9 +7455,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables collection of statistics on database activity.
+        Enables tracking of database activity.
         This parameter is on by default, because the autovacuum
-        daemon needs the collected information.
+        daemon needs the activity information.
         Only superusers can change this setting.
        </para>
       </listitem>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c0108..45095857eb 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2217,12 +2217,13 @@ HINT:  You can then restart the server after making the necessary configuration
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
-    index usage, etc., will be recorded normally on the standby. Replayed
-    actions will not duplicate their effects on primary, so replaying an
-    insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    The activity statistics is collected during recovery. All scans, reads,
+    blocks, index usage, etc., will be recorded normally on the
+    standby. Replayed actions will not duplicate their effects on primary, so
+    replaying an insert will not increment the Inserts column of
+    pg_stat_user_tables.  The activity statistics is reset at the start of
+    recovery, so stats from primary and standby will differ; this is
+    considered a feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index db4b4e460c..90cf3072ac 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    primary server process.  The command arguments
    shown for it are the same ones used when it was launched.  The next five
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,20 +128,21 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
-   is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
-   and indexes in both disk-block and individual-row terms.  It also tracks
-   the total number of rows in each table, and information about vacuum and
-   analyze actions for each table.  It can also count calls to user-defined
-   functions and the total time spent in each one.
+   <productname>PostgreSQL</productname>'s <firstterm>activity
+   statistics</firstterm> is a subsystem that supports tracking and reporting
+   of information about server activity.  Presently, the activity statistics
+   tracks the count of accesses to tables and indexes in both disk-block and
+   individual-row terms.  It also tracks the total number of rows in each
+   table, and information about vacuum and analyze actions for each table.  It
+   can also track calls to user-defined functions and the total time spent in
+   each one.
   </para>
 
   <para>
@@ -151,15 +150,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics.
   </para>
 
  <sect2 id="monitoring-stats-setup">
-  <title>Statistics Collection Configuration</title>
+  <title>Activity Statistics Configuration</title>
 
   <para>
-   Since collection of statistics adds some overhead to query execution,
-   the system can be configured to collect or not collect information.
+   Since tracking for the activity statistics adds some overhead to query
+   execution, the system can be configured to track or not track activity.
    This is controlled by configuration parameters that are normally set in
    <filename>postgresql.conf</filename>.  (See <xref linkend="runtime-config"/> for
    details about setting configuration parameters.)
@@ -172,7 +171,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   to track activity about table and index accesses.
   </para>
 
   <para>
@@ -201,18 +200,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
-   When the server shuts down cleanly, a permanent copy of the statistics
-   data is stored in the <filename>pg_stat</filename> subdirectory, so that
-   statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   down cleanly, a permanent copy of the statistics data is stored in
+   the <filename>pg_stat</filename> subdirectory, so that statistics can be
+   retained across server restarts.  When recovery is performed at server
+   start (e.g. after immediate shutdown, server crash, and point-in-time
+   recovery), all statistics counters are reset.
   </para>
 
  </sect2>
@@ -225,48 +217,46 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    linkend="monitoring-stats-dynamic-views-table"/>, are available to show
    the current state of the system. There are also several other
    views, listed in <xref
-   linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   linkend="monitoring-stats-views-table"/>, available to show the activity
+   statistics.  Alternatively, one can build custom views using the underlying
+   statistics functions, as discussed in
+   <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
-   to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
-   displayed information lags behind actual activity.  However, current-query
-   information collected by <varname>track_activities</varname> is
-   always up-to-date.
+   When using the activity statistics, it is important to realize that the
+   information does not update instantaneously.  Each individual server writes
+   out new statistical counts just before going idle, not frequent than once
+   per <varname>PGSTAT_STAT_INTERVAL</varname> milliseconds (1 second unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals.  However, current-query
+   information tracked by <varname>track_activities</varname> is always
+   up-to-date.
   </para>
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
+   any of these statistics, it first reads the current statistics and then
+   continues to use this snapshot for all statistical views and functions
+   until the end of its current transaction.  So the statistics will show
+   static information as long as you continue the current transaction.
+   Similarly, information about the current queries of all sessions is tracked
+   when any such information is first requested within a transaction, and the
+   same information will be displayed throughout the transaction.  This is a
+   feature, not a bug, because it allows you to perform several queries on the
+   statistics and correlate the results without worrying that the numbers are
+   changing underneath you.  But if you want to see new results with each
+   query, be sure to do the queries outside any transaction block.
+   Alternatively, you can invoke
    <function>pg_stat_clear_snapshot</function>(), which will discard the
    current transaction's statistics snapshot (if any).  The next use of
    statistical information will cause a new snapshot to be fetched.
   </para>
-
+  
   <para>
-   A transaction can also see its own statistics (as yet untransmitted to the
-   collector) in the views <structname>pg_stat_xact_all_tables</structname>,
+   A transaction can also see its own statistics (as yet unwritten to the
+   server-wide activity statistics) in the
+   views <structname>pg_stat_xact_all_tables</structname>,
    <structname>pg_stat_xact_sys_tables</structname>,
    <structname>pg_stat_xact_user_tables</structname>, and
    <structname>pg_stat_xact_user_functions</structname>.  These numbers do not act as
@@ -648,7 +638,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    kernel's I/O cache, and might therefore still be fetched without
    requiring a physical read. Users interested in obtaining more
    detailed information on <productname>PostgreSQL</productname> I/O behavior are
-   advised to use the <productname>PostgreSQL</productname> statistics collector
+   advised to use the <productname>PostgreSQL</productname> activity statistics
    in combination with operating system utilities that allow insight
    into the kernel's handling of I/O.
   </para>
@@ -1086,10 +1076,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1852,6 +1838,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -6128,9 +6118,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry><literal>performing final cleanup</literal></entry>
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
-       <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       <command>VACUUM</command> will vacuum the free space map, update
+       statistics in <literal>pg_class</literal>, and system-wide activity
+       statistics.  When this phase is completed, <command>VACUUM</command>
+       will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bcbb7a25fb..1fa59a2fdf 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1280,11 +1280,10 @@ PostgreSQL documentation
   </para>
 
   <para>
-   The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
-   undesirable, you can set parameter <varname>track_counts</varname>
-   to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
-   USER</literal> command.
+   The database activity of <application>pg_dump</application> is normally
+   collected.  If this is undesirable, you can set
+   parameter <varname>track_counts</varname> to false
+   via <envar>PGOPTIONS</envar> or the <literal>ALTER USER</literal> command.
   </para>
 
  </refsect1>
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5d5802f61d..b5a37b5e0c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -6328,8 +6328,6 @@ restore_backend_variables(BackendParameters *param, Port *port)
     if (postmaster_alive_fds[1] >= 0)
         ReserveExternalFD();
 #endif
-    if (pgStatSock != PGINVALID_SOCKET)
-        ReserveExternalFD();
 }
 
 
-- 
2.27.0

From a2a92fdcd53c28cc7fdde54f8561e826ce631001 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 22:59:33 +0900
Subject: [PATCH v57 5/6] Remove the GUC stats_temp_directory

The new stats collection system doesn't need temporary directory, so
just remove it. pg_stat_statements modified to use pg_stat directory
to store its temporary files.  As the result basebackup copies the
pg_stat_statments' temporary file if exists.
---
 .../pg_stat_statements/pg_stat_statements.c   | 13 +++---
 doc/src/sgml/backup.sgml                      |  2 -
 doc/src/sgml/config.sgml                      | 19 --------
 doc/src/sgml/storage.sgml                     |  6 ---
 src/backend/postmaster/pgstat.c               | 10 -----
 src/backend/replication/basebackup.c          | 36 ----------------
 src/backend/utils/misc/guc.c                  | 43 -------------------
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c                       |  1 -
 src/bin/pg_rewind/filemap.c                   |  7 ---
 src/include/pgstat.h                          |  3 --
 src/test/perl/PostgresNode.pm                 |  4 --
 12 files changed, 6 insertions(+), 139 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..28279f97d5 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -89,14 +89,13 @@ PG_MODULE_MAGIC;
 #define PGSS_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
 /*
- * Location of external query text file.  We don't keep it in the core
- * system's stats_temp_directory.  The core system can safely use that GUC
- * setting, because the statistics collector temp file paths are set only once
- * as part of changing the GUC, but pg_stat_statements has no way of avoiding
- * race conditions.  Besides, we only expect modest, infrequent I/O for query
- * strings, so placing the file on a faster filesystem is not compelling.
+ * Location of external query text file.  We don't keep it in the core system's
+ * pg_stats.  pg_stat_statements has no way of avoiding race conditions even if
+ * the directory were specified by a GUC.  Besides, we only expect modest,
+ * infrequent I/O for query strings, so placing the file on a faster filesystem
+ * is not compelling.
  */
-#define PGSS_TEXT_FILE    PG_STAT_TMP_DIR "/pgss_query_texts.stat"
+#define PGSS_TEXT_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY "/pgss_query_texts.stat"
 
 /* Magic number identifying the stats file format */
 static const uint32 PGSS_FILE_HEADER = 0x20201218;
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index c5557d5444..875769a57e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1155,8 +1155,6 @@ SELECT pg_stop_backup();
     <filename>pg_snapshots/</filename>, <filename>pg_stat_tmp/</filename>,
     and <filename>pg_subtrans/</filename> (but not the directories themselves) can be
     omitted from the backup as they will be initialized on postmaster startup.
-    If <xref linkend="guc-stats-temp-directory"/> is set and is under the data
-    directory then the contents of that directory can also be omitted.
    </para>
 
    <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 99a55d276d..e963d6a249 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7534,25 +7534,6 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-stats-temp-directory" xreflabel="stats_temp_directory">
-      <term><varname>stats_temp_directory</varname> (<type>string</type>)
-      <indexterm>
-       <primary><varname>stats_temp_directory</varname> configuration parameter</primary>
-      </indexterm>
-      </term>
-      <listitem>
-       <para>
-        Sets the directory to store temporary statistics data in. This can be
-        a path relative to the data directory or an absolute path. The default
-        is <filename>pg_stat_tmp</filename>. Pointing this at a RAM-based
-        file system will decrease physical I/O requirements and can lead to
-        improved performance.
-        This parameter can only be set in the <filename>postgresql.conf</filename>
-        file or on the server command line.
-       </para>
-      </listitem>
-     </varlistentry>
-
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 3234adb639..6bac5e075e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -120,12 +120,6 @@ Item
   subsystem</entry>
 </row>
 
-<row>
- <entry><filename>pg_stat_tmp</filename></entry>
- <entry>Subdirectory containing temporary files for the statistics
-  subsystem</entry>
-</row>
-
 <row>
  <entry><filename>pg_subtrans</filename></entry>
  <entry>Subdirectory containing subtransaction status data</entry>
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index bd09bb6d3b..3f546afe6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -99,16 +99,6 @@ bool        pgstat_track_counts = false;
 int            pgstat_track_functions = TRACK_FUNC_OFF;
 int            pgstat_track_activity_query_size = 1024;
 
-/* ----------
- * Built from GUC parameter
- * ----------
- */
-char      *pgstat_stat_directory = NULL;
-
-/* No longer used, but will be removed with GUC */
-char      *pgstat_stat_filename = NULL;
-char      *pgstat_stat_tmpname = NULL;
-
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
  * pgstat_report_wal(). This is used to calculate how much WAL usage
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3d35b55cf9..84b22f4ac9 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -87,9 +87,6 @@ static int    basebackup_read_file(int fd, char *buf, size_t nbytes, off_t offset,
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -152,13 +149,6 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
@@ -261,7 +251,6 @@ perform_base_backup(basebackup_options *opt)
     StringInfo    labelfile;
     StringInfo    tblspc_map_file;
     backup_manifest_info manifest;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
     backup_total = 0;
@@ -284,8 +273,6 @@ perform_base_backup(basebackup_options *opt)
     Assert(CurrentResourceOwner == NULL);
     CurrentResourceOwner = ResourceOwnerCreate(NULL, "base backup");
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -314,18 +301,6 @@ perform_base_backup(basebackup_options *opt)
         tablespaceinfo *ti;
         int            tblspc_streamed = 0;
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = -1;
@@ -1377,17 +1352,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
         if (excludeFound)
             continue;
 
-        /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
         /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 842a3d755d..e1195f6589 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -203,7 +203,6 @@ static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource sourc
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
 static bool check_huge_page_size(int *newval, void **extra, GucSource source);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -560,8 +559,6 @@ char       *HbaFileName;
 char       *IdentFileName;
 char       *external_pid_file;
 
-char       *pgstat_temp_directory;
-
 char       *application_name;
 
 int            tcp_keepalives_idle;
@@ -4375,17 +4372,6 @@ static struct config_string ConfigureNamesString[] =
         NULL, NULL, NULL
     },
 
-    {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_ACTIVITY,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
     {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_PRIMARY,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
@@ -11791,35 +11777,6 @@ check_huge_page_size(int *newval, void **extra, GucSource source)
     return true;
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ffa65d2ad3..0d13cf4cc8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -590,7 +590,6 @@
 #track_wal_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3c1cf78b4f..07a00b8d0d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -219,7 +219,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 2618b4c957..ab5cb51de7 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -87,13 +87,6 @@ struct exclude_list_item
  */
 static const char *excludeDirContents[] =
 {
-    /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    "pg_stat_tmp",                /* defined as PG_STAT_TMP_DIR */
-
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f4177eb284..e1c54e73f2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -33,9 +33,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/saved_stats"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/saved_stats.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR                "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..5f2de691b5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -455,10 +455,6 @@ sub init
     print $conf TestLib::slurp_file($ENV{TEMP_CONFIG})
       if defined $ENV{TEMP_CONFIG};
 
-    # XXX Neutralize any stats_temp_directory in TEMP_CONFIG.  Nodes running
-    # concurrently must not share a stats_temp_directory.
-    print $conf "stats_temp_directory = 'pg_stat_tmp'\n";
-
     if ($params{allows_streaming})
     {
         if ($params{allows_streaming} eq "logical")
-- 
2.27.0

From 1e1505d6cacac92941dc83f17d98dc8dd68171ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 29 Sep 2020 23:19:58 +0900
Subject: [PATCH v57 6/6] Exclude pg_stat directory from base backup

basebackup sends the content of pg_stat directory, which is doomed to
be removed at startup from the backup. Now that pg_stat_statements
saves a temporary file into the directory, let exclude pg_stat
directory from a base backup.
---
 src/backend/replication/basebackup.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 84b22f4ac9..a2c88bd3a4 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -149,6 +149,13 @@ struct exclude_list_item
  */
 static const char *const excludeDirContents[] =
 {
+    /*
+     * Skip statistics files. PGSTAT_STAT_PERMANENT_DIRECTORY must be skipped
+     * because the files in the directory will be removed at startup from the
+     * backup.
+     */
+    PGSTAT_STAT_PERMANENT_DIRECTORY,
+
     /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another primary. See backup.sgml
-- 
2.27.0


Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:
> > At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > > Thanks for committing this!  I'm very happy to see this reduces the
> > > > size of this patchset.
> > > 
> > > Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
> > > with old 0004, I'd like to post a rebased version for future work.
> > > 
> > > The commit 33394ee6f2 adds on-exit forced write of WAL stats on
> > > walwriter and in this patch that part would appear to have been
> > > removed.  However, this patchset already does that by calling to
> > > pgstat_report_stat from pgstat_beshutdown_hook.
> > 
> > Rebased and fixed two bugs.  Not addressed received comments in this
> > version.
> 
> Since I am heavily editing the code, could you submit "functional"
> changes (as opposed to fixing rebase issues) as incremental patches?

Oh.. please wait for.. a moment, maybe.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 22 Mar 2021 09:55:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > Hi,
> > 
> > On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:
> > > At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > > Rebased and fixed two bugs.  Not addressed received comments in this
> > > version.
> > 
> > Since I am heavily editing the code, could you submit "functional"
> > changes (as opposed to fixing rebase issues) as incremental patches?
> 
> Oh.. please wait for.. a moment, maybe.

This is that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index c09fa026b9..3f546afe6a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1175,13 +1175,6 @@ pgstat_report_stat(bool force)
     if (area == NULL)
         return 0;
 
-    /*
-     * We need a database entry if the following stats exists.
-     */
-    if (pgStatXactCommit > 0 || pgStatXactRollback > 0 ||
-        pgStatBlockReadTime > 0 || pgStatBlockWriteTime > 0)
-        get_local_dbstat_entry(MyDatabaseId);
-
     /* Don't expend a clock check if nothing to do */
     if (pgStatLocalHash == NULL && have_slrustats && !walstats_pending())
         return 0;
@@ -1221,8 +1214,6 @@ pgstat_report_stat(bool force)
     {
         int            remains = 0;
         pgstat_localhash_iterator i;
-        List       *dbentlist = NIL;
-        ListCell   *lc;
         PgStatLocalHashEntry *lent;
 
         /* Step 1: flush out other than database stats */
@@ -1234,13 +1225,11 @@ pgstat_report_stat(bool force)
             switch (lent->key.type)
             {
                 case PGSTAT_TYPE_DB:
-
                     /*
                      * flush_tabstat applies some of stats numbers of flushed
-                     * entries into local database stats. Just remember the
-                     * database entries for now then flush-out them later.
+                     * entries into local and shared database stats. Treat them
+                     * separately later.
                      */
-                    dbentlist = lappend(dbentlist, lent);
                     break;
                 case PGSTAT_TYPE_TABLE:
                     if (flush_tabstat(lent, nowait))
@@ -1268,9 +1257,11 @@ pgstat_report_stat(bool force)
         }
 
         /* Step 2: flush out database stats */
-        foreach(lc, dbentlist)
+        pgstat_localhash_start_iterate(pgStatLocalHash, &i);
+        while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
         {
-            PgStatLocalHashEntry *lent = (PgStatLocalHashEntry *) lfirst(lc);
+            /* no other types of entry must be found here */
+            Assert(lent->key.type == PGSTAT_TYPE_DB);
 
             if (flush_dbstat(lent, nowait))
             {
@@ -1281,8 +1272,6 @@ pgstat_report_stat(bool force)
                 pgstat_localhash_delete(pgStatLocalHash, lent->key);
             }
         }
-        list_free(dbentlist);
-        dbentlist = NULL;
 
         if (remains <= 0)
         {
@@ -1507,7 +1496,7 @@ flush_dbstat(PgStatLocalHashEntry *ent, bool nowait)
 
     Assert(ent->key.type == PGSTAT_TYPE_DB);
 
-    localent = (PgStat_StatDBEntry *) &ent->body;
+    localent = (PgStat_StatDBEntry *) ent->body;
 
     /* find shared database stats entry corresponding to the local entry */
     sharedent = (PgStat_StatDBEntry *)
@@ -3215,11 +3204,8 @@ pgstat_fetch_stat_dbentry(Oid dbid)
     /* should be called from backends */
     Assert(IsUnderPostmaster);
 
-    /* the simple cache doesn't work properly for InvalidOid */
-    Assert(dbid != InvalidOid);
-
     /* Return cached result if it is valid. */
-    if (cached_dbent_key.databaseid == dbid)
+    if (dbid != 0 && cached_dbent_key.databaseid == dbid)
         return &cached_dbent;
 
     shent = (PgStat_StatDBEntry *)

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Fri, 19 Mar 2021 14:27:38 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2021-03-10 20:26:56 -0800, Andres Freund wrote:
> > > +         * We expose this shared entry now.  You might think that the entry
> > > +         * can be removed by a concurrent backend, but since we are creating
> > > +         * an stats entry, the object actually exists and used in the upper
> > > +         * layer. Such an object cannot be dropped until the first vacuum
> > > +         * after the current transaction ends.
> > > +         */
> > > +        dshash_release_lock(pgStatSharedHash, shhashent);
> >
> > I don't think you can safely release the lock before you incremented the
> > refcount?  What if, once the lock is released, somebody looks up that
> > entry, increments the refcount, and decrements it again? It'll see a
> > refcount of 0 at the end and decide to free the memory. Then the code
> > below will access already freed / reused memory, no?
> 
> Yep, it's not even particularly hard to hit:
>
> S0: CREATE TABLE a_table();
> S0: INSERT INTO a_table();
> S0: disconnect
> S1: set a breakpoint to just after the dshash_release_lock(), with an
>  if objid == a_table_oid
> S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
>   (will break at above breakpoint, without having incremented the
>   refcount yet)
> S2: DROP TABLE a_table;
> S2: VACUUM pg_class;
> 
> At that point S2's call to pgstat_vacuum_stat() will find the shared
> stats entry for a_table, delete the entry from the shared hash table,
> see that the stats data has a zero refcount, and free it. Once S1 wakes
> up it'll use already freed (and potentially since reused) memory.

Sorry for the delay.  You're right. I actually see permanent-block when
continue running S1 after the vacuum.  That happens at LWLockRelease
on freed block.

Moving the refcount bumping before the dshash_release_lock call fixes
that.  One issue doing that *was* get_stat_entry() had the path for
the case pgStatCacheContext is not available, which is already
dead. After the early lock releasing is removed, the comment is no
longer needed, too.

While working on this, I noticed that the previous diff
v56-57-func-diff.txt was slightly stale (missing a later bug fix).  So
attached contains a fix on the amendment path.

Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f546afe6a..3e2b90e92b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -884,6 +884,8 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
                                      create, nowait, create, &shfound);
     if (shhashent)
     {
+        bool        lofound;
+
         if (create && !shfound)
         {
             /* Create new stats entry. */
@@ -900,38 +902,27 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
         else
             shheader = dsa_get_address(area, shhashent->body);
 
-        /*
-         * We expose this shared entry now.  You might think that the entry
-         * can be removed by a concurrent backend, but since we are creating
-         * an stats entry, the object actually exists and used in the upper
-         * layer. Such an object cannot be dropped until the first vacuum
-         * after the current transaction ends.
-         */
+        pg_atomic_add_fetch_u32(&shheader->refcount, 1);
         dshash_release_lock(pgStatSharedHash, shhashent);
 
-        /* register to local hash if possible */
-        if (pgStatEntHash || pgStatCacheContext)
+        /* Create local hash if not yet */
+        if (pgStatEntHash == NULL)
         {
-            bool        lofound;
+            Assert(pgStatCacheContext);
 
-            if (pgStatEntHash == NULL)
-            {
-                pgStatEntHash =
-                    pgstat_localhash_create(pgStatCacheContext,
-                                            PGSTAT_TABLE_HASH_SIZE, NULL);
-                pgStatEntHashAge =
-                    pg_atomic_read_u64(&StatsShmem->gc_count);
-            }
-
-            lohashent =
-                pgstat_localhash_insert(pgStatEntHash, key, &lofound);
-
-            Assert(!lofound);
-            lohashent->body = shheader;
-            lohashent->dsapointer = shhashent->body;
-
-            pg_atomic_add_fetch_u32(&shheader->refcount, 1);
+            pgStatEntHash =
+                pgstat_localhash_create(pgStatCacheContext,
+                                        PGSTAT_TABLE_HASH_SIZE, NULL);
+            pgStatEntHashAge =
+                pg_atomic_read_u64(&StatsShmem->gc_count);
         }
+
+        lohashent =
+            pgstat_localhash_insert(pgStatEntHash, key, &lofound);
+
+        Assert(!lofound);
+        lohashent->body = shheader;
+        lohashent->dsapointer = shhashent->body;
     }
 
     if (found)
@@ -1260,10 +1251,12 @@ pgstat_report_stat(bool force)
         pgstat_localhash_start_iterate(pgStatLocalHash, &i);
         while ((lent = pgstat_localhash_iterate(pgStatLocalHash, &i)) != NULL)
         {
-            /* no other types of entry must be found here */
-            Assert(lent->key.type == PGSTAT_TYPE_DB);
-
-            if (flush_dbstat(lent, nowait))
+            /*
+             * The loop above may have left some non-db entries while system is
+             * busy.  Process only database stats entries here.
+             */
+            if (lent->key.type == PGSTAT_TYPE_DB &&
+                flush_dbstat(lent, nowait))
             {
                 remains--;
                 /* Remove the successfully flushed entry */

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
Thank you for the the lot of help!


At Mon, 22 Mar 2021 16:17:37 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Thanks! That change shouldn't be necessary on my branch - I did
> something to fix this kind of problem too. I decided that there's no
> point in doing hash table lookups for the database: It's not going to
> change in the life of a backend. So there's now two static "pending"

Right.

> entries: One for the current DB, one for the shared DB.  There's only
> one place that needed to change,
> pgstat_report_checksum_failures_in_db(), which now reports the changes
> directly instead of going via pending.
> I suspect we should actually do that with a number of other DB specific
> functions. Things like recovery conflicts, deadlocks, checksum failures
> imo should really not be delayed till later. And you should never have
> enough of them to make contention a concern.

Sounds readonable.

> You can see a somewhat sensible list of changes from your v52 at
> https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22
> (I did fix some of damage from rebase in a non-incremental way, of course)
> 
> My branch: https://github.com/anarazel/postgres/tree/shmstat
> 
> It would be cool if you could check if there any relevant things between
> v52 and v56 that I should include.
> 
> 
> I think a lot of the concerns I had with the patch are addressed at the
> end of my series of changes.  Please let me know what you think.

I like the name "stats subsystem".

https://github.com/anarazel/postgres/commit/f28463601e93c68f4dd50fe930d29a54509cffc7

I'm impressed that the way you resolved "who should load stats". Using
static shared memory area to hold the point to existing DSA memory
resolves the "first attacher problem".  However somewhat doubtful
about the "who should write the stats file", I think it is reasonable
in general.

But the current place of calling pgstat_write_stats() is a bit to
early.  Checkpointer reports some stats *after* calling
ShutdownXLOG().  Perhaps we need to move it after pg_stat_report_*()
calls in HandleCheckpointerInterrupts().


Separating pgbestat_backend_initialize() from pgstat_initialize()
allows us to initialize stats subsystem earlier in autovacuum workers,
which looks nice.


https://github.com/anarazel/postgres/commit/3304ee1344f348e079b5eb208d76a2f1553e721c

>     * Whenever the for a dropped stats entry could not be freed (because
>     * backends still have references), this is incremented, causing backends
>     * to run pgstat_lookup_cache_gc(), allowing that memory to be reclaimed.

"Whenever the <what?> for a "

gc_count is incremented whenever *some stats hash entries are
removed*. Some of the delinked shared stats area might not be freed
due to references.

If each backend finds that gc_count is incremented, it removes
corresponding local hash entries to the delinked shared entries. If
the backend was the last referer, it frees the shared area.


https://github.com/anarazel/postgres/commit/88ffb289860c7011e729cd0a1a01cda1899e6209

Ah, it sounds nice that refcount == 1 means it is to be dropped and no
one is referring to it. Thanks!

https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37

+         * XXX: Why is this a good place to do this?

Agreed. We don't need to so haste to clean up stats entries.  We could
run that in pgstat_reporT_stat()?


flush_walstat()

I found a mistake of an existing comment.

- * If nowait is true, this function returns false on lock failure. Otherwise
- * this function always returns true.
+ * If nowait is true, this function returns true on lock failure. Otherwise
+ * this function always returns false.


https://github.com/anarazel/postgres/commit/7bde068d8a512d918f76cfc88c1c10f1db8fe553
(pgstat_reset_replslot_counter())
+     * AFIXME: pgstats has business no looking into slot.c structures at
+     * this level of detail.

Does just moving name resolution part to pgstatfuncs.c resolve it?
pgstat_report_replslot_drop() have gotten fixed a similar way.


https://github.com/anarazel/postgres/commit/ded2198d93ce5944fc9d68031d86dd84944053f8

Yeah, I forcefully consolidated replslot stats are into stats-hash but
I agree that it would be more natural that replslot stats are
fixed-sized stats.

https://github.com/anarazel/postgres/commit/e2ef1931fb51da56a6ba483c960e034e52f90430

Agreed that it's better to move database stat entries to fixed pointers.

> My next step is going to be to squash all my changes into the base
> patch, and try to extract all the things that I think can be
> independently committed, and to reduce unnecessary diff noise.  Once
> that's done I plan to post that series to the list.
> 
> 
> TODO:
> 
> - explain the design at the top of pgstat.c
> 
> - figure out a way to deal with the different demands on stats
>   consistency / efficiency
> 
> - see how hard it'd be to not need collect_oids()
> 
> - split pgstat.c
> 
> - consider removing PgStatTypes and replacing it with the oid of the
>   table the type of stats reside in. So PGSTAT_TYPE_DB would be
>   DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...
>
>   I think that'd make the system more cleanly extensible going forward?

I'm not sure that works as expected.  We already separated repliation
stats from the unified stats hash and pgstat_read/write_statsfile()
needs have the corresponding specific code path.

> - I'm not yet happy with the naming schemes in use in pgstat.c. I feel
>   like I improved it a bunch, but it's not yet there.

I feel the same about the namings.

> - the replication slot stuff isn't quite right in my branch

Ah, yeah. As I mentioned above I think it should be in the unified
stats and should have a special means of shotcut.  And the global
stats also should be the same.

> - I still don't quite like the reset_offset stuff - I wonder if we can
>   find something better there. And if not, whether we can deduplicate
>   the code between functions like pgstat_fetch_stat_checkpointer() and
>   pgstat_report_checkpointer().

Yeah, I find it annoying.  If we had reset-offset as negatives (or 2's
complements) the two arithmetic are in the same shape.  It might be
somewhat tricky but we can deduplicate the code.  (In exchange, we
would have additional code to convert the reset offset.)

>   At the very least it'll need a lot better comments.
> 
> - bunch of FIXMEs / XXXs

I'll lool more close to the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 15 Mar 2021 17:51:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Fri, 12 Mar 2021 22:20:40 -0800, Andres Freund <andres@anarazel.de> wrote in 
> > Horiguchi-san, is there a chance you could add a few tests (on master)
> > that test/document the way stats are kept across "normal" restarts, and
> > thrown away after crash restarts/immediate restarts and also thrown away
> > graceful streaming rep shutdowns?
> 
> Roger. I'll give it a try.

Sorry, I forgot this. This is it.

retards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
# Verify stats data persistence

use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 6;

my $backup_name = 'my_backup';

# Initialize a test cluster
my $node_primary = get_new_node('primary');
$node_primary->init(allows_streaming => 1);
$node_primary->start;

$node_primary->safe_psql('postgres', 'CREATE TABLE t(); SELECT * from t;');
my $count =
  $node_primary->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '1', "counter is incremented");

# Stas numbers are restored after graceful restart
$node_primary->restart;

$count =
  $node_primary->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '1', "counter is correctly restored");

# Stas numbers are blown away after a crash
$node_primary->stop('immediate');
$node_primary->start;

$count =
  $node_primary->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '0', "counter is reset after crash");

# Create a standby
$node_primary->backup($backup_name);
my $node_standby = get_new_node('standby');
$node_standby->init_from_backup($node_primary, $backup_name,
                                has_streaming => 1);
$node_standby->start;

# Stats numbers are initialized to zero
$count =
  $node_standby->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '0', "counter is initialized");

# Stats numbers are incremented also on standby
$node_standby->safe_psql('postgres', 'SELECT * from t;');
$count =
  $node_standby->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '1', "counter is incremented on standby");

# Stas numbers are always blown away even after graceful restart on standby
$node_primary->restart;

$count =
  $node_primary->safe_psql('postgres',
                           'SELECT seq_scan FROM pg_stat_user_tables');
is($count, '0', "counter is reset after restart on standby");

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Thu, 1 Apr 2021 19:44:25 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> I spent quite a bit more time working on the patch. There's are large
> changes:
> 
> - postmaster/pgstats.c (which is an incorrect location now that it's not
>   a subprocess anymore) is split into utils/activity/pgstat.c and
>   utils/activity/pgstat_kind.c. I don't love the _kind name, but I
>   couldn't come up with anything better.

The place was not changed to keep footprint smaller.  I agree that the
old place is not appropriate.  pgstat_kind...  How about changin
pgstat.c to pgstat_core.c and pgstat_kind.c to pgstat.c?

> - Implemented a new GUC, stats_fetch_consistency = {none, cache,
>   snapshot}. I think the code overhead of it is pretty ok - most of the
>   handling is entirely generalized.

Sounds good.

> - Most of the "per stats kind" handling is done in pgstat_kind.c. Nearly
>   all the rest is done through an array with per-stats-kind information
>   (extending what was already done with pgstat_sharedentsize etc).
>
> - There is no separate "pending stats" hash anymore. If there are
>   pending stats, they are referenced from 'pgStatSharedRefHash' (which
>   used to be the "lookup cache" hash). All the entries with pending
>   stats are in double linked list pgStatPending.

Sounds reasonable. A bit silimar to TabStatusArray.. Pending stats and
shared stats share the same key so they are naturally consolidatable.

> - A stat's entry's lwlock, refcount, .. are moved into the dshash
>   entry. There is no need for them to be separate anymore. Also allows
>   to avoid doing some dsa lookups while holding dshash locks.
>
> - The dshash entries are not deleted until the refcount has reached
>   0. That's an important building block to avoid constantly re-creating
>   stats when flushing pending stats for a dropped object.

Does that mean the entries for a dropped object is actually dropped by
the backend that has been flushd stats of the dropped object at exit?
Sounds nice.

> - The reference to the shared entry is established the first time stats
>   for an object are reported. Together with the previous entry that
>   avoids nearly all the avenues for re-creating already dropped stats
>   (see below for the hole).
> 
> - I addded a bunch of pg_regress style tests, and a larger amount of
>   isolationtester tests. The latter are possibly due to a new
>   pg_stat_force_next_flush() function, avoiding the need to wait until
>   stats are submitted.
> 
> - 2PC support for "precise" dropping of stats has been added, the
>   collect_oids() based approach removed.

Cool!

> - lots of bugfixes, comments, etc...

Thanks for all of them.

> I know of one nontrivial issue that can lead to dropped stats being
> revived:
> 
> Within a transaction, a functions can be called even when another
> transaction that dropped that function has already committed. I added a
> spec test reproducing the issue:
> 
> # FIXME: this shows the bug that stats will be revived, because the
> # shared stats in s2 is only referenced *after* the DROP FUNCTION
> # committed. That's only possible because there is no locking (and
> # thus no stats invalidation) around function calls.
> permutation
>   "s1_track_funcs_all" "s2_track_funcs_none"
>   "s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" "s2_func_call" "s2_commit" "s2_ff"
"s1_func_stats""s2_func_stats"
 
> 
> I think the best fix here would be to associate an xid with the dropped
> stats object, and only delete the dshash entry once there's no alive
> with a horizon from before that xid...

I'm not sure how we do that avoiding a full scan on dshash..

> There's also a second issue (stats for newly created objects surviving
> the transaction), but that's pretty simple to resolve.
> 
> Here's all the gory details of my changes happening incrementally:
> 
> https://github.com/anarazel/postgres/compare/master...shmstat
> 
> I'll squash and split tomorrow. Too tired for today.

Thank you very much for all of your immense effort.

> I think this is starting to look a lot better than what we have now. But
> I'm getting less confident that it's realistic to get any of this into
> PG14, given the state of the release cycle.
> 
> 
> 
> > I'm impressed that the way you resolved "who should load stats". Using
> > static shared memory area to hold the point to existing DSA memory
> > resolves the "first attacher problem".  However somewhat doubtful
> > about the "who should write the stats file", I think it is reasonable
> > in general.
> >
> > But the current place of calling pgstat_write_stats() is a bit to
> > early.  Checkpointer reports some stats *after* calling
> > ShutdownXLOG().  Perhaps we need to move it after pg_stat_report_*()
> > calls in HandleCheckpointerInterrupts().
> 
> I now moved it into a before_shmem_exit(). I think that should avoid
> that problem?

I think so.

> > https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37
> >
> > +         * XXX: Why is this a good place to do this?
> >
> > Agreed. We don't need to so haste to clean up stats entries.  We could
> > run that in pgstat_reporT_stat()?
> 
> I've not changed that yet, but I had the same thought.
> 
> 
> > Agreed that it's better to move database stat entries to fixed
> > pointers.
> 
> I actually ended up reverting that. My main motivation for it was that
> it was problematic that new pending database stats entries could be
> created at some random place in the hashtable. But with the linked list
> of pending entries that's not a problem anymore. And I found it
> nontrivial to manage the refcounts to the shared entry accurately this
> way.
> 
> We could still add a cache for the two stats entries though...

Yeah.

> > > - consider removing PgStatTypes and replacing it with the oid of the
> > >   table the type of stats reside in. So PGSTAT_TYPE_DB would be
> > >   DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...
> > >
> > >   I think that'd make the system more cleanly extensible going forward?
> >
> > I'm not sure that works as expected.  We already separated repliation
> > stats from the unified stats hash and pgstat_read/write_statsfile()
> > needs have the corresponding specific code path.
> 
> I didn't quite go towards my proposal, but I think I got a lot closer
> towards not needing much extra code for additional types of stats. I
> even added an XXX to pgstat_read/write_statsfile() that show how they
> now could be made generic.

I'll check it.

> > > - the replication slot stuff isn't quite right in my branch
> >
> > Ah, yeah. As I mentioned above I think it should be in the unified
> > stats and should have a special means of shotcut.  And the global
> > stats also should be the same.
> 
> The problem is that I use indexes for addressing, but that they can
> change between restarts. I think we can fix that fairly easily, by
> mapping names to indices once, pgstat_restore_stats().  At the point we
> call pgstat_restore_stats() StartupReplicationSlots() already was
> executed, so we can just inquire at that point...

Does that mean the saved replslot stats is keyed by their names?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2021-03-16 12:54:40 -0700, Andres Freund wrote:
> I did consider command_progress.c too - but that seems confusing because
> there's src/include/commands/progress.h, which is imo a different layer
> than what pgstat/backend_progress provide. So I thought splitting things
> up so that backend_progress.[ch] provide the place to store the progress
> values, and commands/progress.h defining the meaning of the values as
> used for in-core postgres commands would make sense.  I could see us
> using the general progress infrastructure for things that'd not fit
> super well into commands/* at some point...

Thinking about it some more, having the split between backend_status.h
and commands/progress.h actually makes a fair bit of sense from another
angle: Commands utilizing workers. backend_status.h provides
infrastructure to store progress counters for a single backend, but
multiple backends can participate in a command...

I added some comments to the header to that end.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Tue, 6 Apr 2021 09:32:16 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2021-04-05 02:29:14 -0700, Andres Freund wrote:
..
> I'm inclined to push patches
> [PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
> [PATCH v60 06/17] pgstat: Split out relation stats handling from AtEO[Sub]Xact_PgStat() etc.
> [PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
> [PATCH v60 10/17] pgstat: Split different types of stats into separate files.
> [PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.

FWIW..

05 is a straight forward code-rearrange and reasonable to apply.

06 is same as above and it seems to make things cleaner.

09 mainly adds ensure_tabtat_xact_level() to remove repeated code
  blocks a straight-forward way. I wonder if
  pgstat_xact_stack_level_get() might better be
  pgstat_get_xact_stack_level(), but I'm fine with the name in the
  patch.

10 I found that the kind in "pgstat_kind" meant the placeholder for
  specific types.  It looks good to separate them into smaller pieces.
  It is also a simple rearrangement of code.

> pgstat.c is very long, and it's hard to find an order that makes sense
> and is likely to be maintained over time. Splitting the different

  I deeply agree to "hard to find an order that makes sense".

12 I'm not sure how it looks after this patch (I failed to apply 09 at
  my hand.), but it is also a simple rearrangement of code blocks.

> to v14. They're just moving things around, so are fairly low risk. But
> they're going to be a pain to maintain. And I think 10 and 12 make
> pgstat.c a lot easier to understand.

I think that pgstat.c doesn't get frequent back-patching.  It seems to
me that at least 10 looks good.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Ibrar Ahmed
Дата:


On Wed, Apr 7, 2021 at 8:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 6 Apr 2021 09:32:16 -0700, Andres Freund <andres@anarazel.de> wrote in
> Hi,
>
> On 2021-04-05 02:29:14 -0700, Andres Freund wrote:
..
> I'm inclined to push patches
> [PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
> [PATCH v60 06/17] pgstat: Split out relation stats handling from AtEO[Sub]Xact_PgStat() etc.
> [PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
> [PATCH v60 10/17] pgstat: Split different types of stats into separate files.
> [PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.

FWIW..

05 is a straight forward code-rearrange and reasonable to apply.

06 is same as above and it seems to make things cleaner.

09 mainly adds ensure_tabtat_xact_level() to remove repeated code
  blocks a straight-forward way. I wonder if
  pgstat_xact_stack_level_get() might better be
  pgstat_get_xact_stack_level(), but I'm fine with the name in the
  patch.

10 I found that the kind in "pgstat_kind" meant the placeholder for
  specific types.  It looks good to separate them into smaller pieces.
  It is also a simple rearrangement of code.

> pgstat.c is very long, and it's hard to find an order that makes sense
> and is likely to be maintained over time. Splitting the different

  I deeply agree to "hard to find an order that makes sense".

12 I'm not sure how it looks after this patch (I failed to apply 09 at
  my hand.), but it is also a simple rearrangement of code blocks.

> to v14. They're just moving things around, so are fairly low risk. But
> they're going to be a pain to maintain. And I think 10 and 12 make
> pgstat.c a lot easier to understand.

I think that pgstat.c doesn't get frequent back-patching.  It seems to
me that at least 10 looks good.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center



The patch does not apply, and require rebase,
1 out of 8 hunks FAILED -- saving rejects to file src/include/pgstat.h.rej
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 8758 (offset 34 lines).
patching file src/backend/postmaster/checkpointer.c
Hunk #3 succeeded at 496 with fuzz 1.
Hunk #4 FAILED at 576.
1 out of 6 hunks FAILED -- saving rejects to file src/backend/postmaster/checkpointer.c.rej
patching file src/backend/postmaster/pgstat.c

I am changing the status to "Waiting on Author".

--
Ibrar Ahmed

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Mon, 19 Jul 2021 15:34:56 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in 
> The patch does not apply, and require rebase,

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository.  So I'm at a loss what to
do for the failure..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2021-07-21 17:09:49 +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Jul 2021 15:34:56 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in 
> > The patch does not apply, and require rebase,
> 
> Yeah, thank you very much for checking that. However, this patch is
> now developed in Andres' GitHub repository.  So I'm at a loss what to
> do for the failure..

I'll post a rebased version soon.

Greetings,

Andres Freund



Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
> > Yeah, thank you very much for checking that. However, this patch is
> > now developed in Andres' GitHub repository.  So I'm at a loss what to
> > do for the failure..
> 
> I'll post a rebased version soon.

(Sorry if you feel being hurried, which I didn't meant to.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote:
> > > Yeah, thank you very much for checking that. However, this patch is
> > > now developed in Andres' GitHub repository.  So I'm at a loss what to
> > > do for the failure..
> > 
> > I'll post a rebased version soon.
> 
> (Sorry if you feel being hurried, which I didn't meant to.)

No worries!

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/shmstat
[2] https://www.postgresql.org/message-id/20210722205458.f2bug3z6qzxzpx2s%40alap3.anarazel.de



Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On 2021-07-26 18:27:54 -0700, Andres Freund wrote:
> I had intended to post a rebase by now. But while I did mostly finish
> that (see [1]) I unfortunately encountered a new issue around
> partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
> that thread about the best way to address the issues.

Now that https://postgr.es/m/20220125063131.4cmvsxbz2tdg6g65%40alap3.anarazel.de
is resolved, here's a rebased version. With a good bit of further cleanup.

One "big" thing that I'd like to figure out is a naming policy for the
different types prefixed with PgStat. We have different groups of types:

- "pending statistics", that are accumulated but not yet submitted to the
  shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry
  etc
- accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About half are
  prefixed with PgStat_Stat, the other just with PgStat_
- random other types like PgStat_Single_Reset_Type, ...

To me it's very confusing to have these all in an essentially undistinguishing
namespace, particularly the top two items.

I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a
distinct PgStatPending_* for the pending item? I can't quite come up with a
good name for the "accumulated" ones.


I'd like that get resolved first because I think that'd allow commiting the
prepatory split and reordering patches.

Greetings,

Andres Freund

Вложения

Re: shared-memory based stats collector

От
Kyotaro Horiguchi
Дата:
At Wed, 2 Mar 2022 18:16:00 -0800, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2021-07-26 18:27:54 -0700, Andres Freund wrote:
> > I had intended to post a rebase by now. But while I did mostly finish
> > that (see [1]) I unfortunately encountered a new issue around
> > partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
> > that thread about the best way to address the issues.
> 
> Now that https://postgr.es/m/20220125063131.4cmvsxbz2tdg6g65%40alap3.anarazel.de
> is resolved, here's a rebased version. With a good bit of further cleanup.
> 
> One "big" thing that I'd like to figure out is a naming policy for the
> different types prefixed with PgStat. We have different groups of types:
> 
> - "pending statistics", that are accumulated but not yet submitted to the
>   shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry
>   etc
> - accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About half are
>   prefixed with PgStat_Stat, the other just with PgStat_
> - random other types like PgStat_Single_Reset_Type, ...
> 
> To me it's very confusing to have these all in an essentially undistinguishing
> namespace, particularly the top two items.

Profoundly agreed.  It was always a pain in the neck.

> I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a
> distinct PgStatPending_* for the pending item? I can't quite come up with a
> good name for the "accumulated" ones.

How about naming "pending stats" as just "Stats" and the "acculumated
stats" as "counts" or "counters"?  "Counter" doesn't reflect the
characteristics so exactly but I think the discriminability of the two
is more significant.  Specifically;

- PgStat_TableStatus
+ PgStat_TableStats
- PgStat_BackendFunctionEntry
+ PgStat_FunctionStats

- PgStat_GlobalStats
+ PgStat_GlobalCounts
- PgStat_ArchiverStats
+ PgStat_ArchiverCounts
- PgStat_BgWriterStats
+ PgStat_BgWriterCounts

Moving to shared stats collector turns them into attributed by "Local"
and "Shared". (I don't consider the details at this stage.)

PgStatLocal_TableStats
PgStatLocal_FunctionStats
PgStatLocal_GlobalCounts
PgStatLocal_ArchiverCounts
PgStatLocal_BgWriterCounts

PgStatShared_TableStats
PgStatShared_FunctionStats
PgStatShared_GlobalCounts
PgStatShared_ArchiverCounts
PgStatShared_BgWriterCounts

PgStatLocal_GlobalCounts somewhat looks odd, but doesn't matter much, maybe.

> I'd like that get resolved first because I think that'd allow commiting the
> prepatory split and reordering patches.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

I've attached a substantially improved version of the shared memory stats
patch.

The biggest changes are:
- chopped the bigger patches into smaller chunks. Most importantly the
  "transactional stats creation / drop" part is now its own commit. Also the
  tests I added.

- put in a defense against the danger of loosing function stats entries due to
  the lack of cache invalidation during function calls. See long comment in
  pgstat_init_function_usage()

- split up pgstat_global.c into pgstat_{archiver, bgwriter, checkpoint,
  replslot, slru, wal}. While each individually not large, there were enough
  of them to make the file confusing. Feels a lot better to work with.

- replication slot stats used the slot "index" in a dangerous way, fixed

- implemented a few omitted features like resetting all subscriptions and
  setting reset timestamps (Melanie)

- loads of code and comment polishing


I think the first few patches are basically ready to be applied and are
independently worthwhile:
- 0001-pgstat-run-pgindent-on-pgstat.c-h.patch
- 0002-pgstat-split-relation-database-stats-handling-ou.patch
- 0003-pgstat-split-out-WAL-handling-from-pgstat_-initi.patch
- 0004-pgstat-introduce-pgstat_relation_should_count.patch
- 0005-pgstat-xact-level-cleanups-consolidation.patch

Might not be worth having separately, should probably just be part of
0014:
- 0006-pgstat-wip-pgstat-relation-init-assoc.patch

A pain to maintain, needs mostly a bit of polishing of file headers. Perhaps I
should rename pgstat_checkpoint.c to pgstat_checkpointer.c, fits better with
function names:
- 0007-pgstat-split-different-types-of-stats-into-separ.patch

This is also painful to maintain. Mostly kept separate from 0007 for easier
reviewing:
- 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch

Everything after isn't yet quite there / depend on patches that aren't yet
there:
- 0010-pgstat-add-pgstat_copy_relation_stats.patch
- 0011-pgstat-remove-superflous-comments-from-pgstat.h.patch
- 0012-pgstat-stats-collector-references-in-comments.patch
- 0013-pgstat-scaffolding-for-transactional-stats-creat.patch
- 0014-pgstat-store-statistics-in-shared-memory.patch
- 0015-pgstat-add-pg_stat_force_next_flush.patch

Notably this patch makes stat.sql a test that can safely be run concurrent
with other tests:
- 0016-pgstat-utilize-pg_stat_force_next_flush-to-simpl.patch

Needs a tap test as well, but already covers a lot of things that aren't
covered today. Unfortunately it can't really be applied before because it's
too hard to write / slow to run without 0015
- 0017-pgstat-extend-pgstat-test-coverage.patch

I don't yet know what we should do with other users of
PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
that pg_stat_statement is enough of a reason to keep the stats_temp_directory
GUC around?
- 0019-pgstat-wip-remove-stats_temp_directory.patch


Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch


Starting to feel more optimistic about this! There's loads more to do, but now
the TODOs just seem to require elbow grease, rather than deep thinking.

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

- add longer explanation of architecture to pgstat.c (or a README)

- Further improve our stats test coverage - there's a crapton not covered,
  despite 0017:
  - test WAL replay with stats (stats for dropped tables are removed etc)
  - test crash recovery and "invalid stats file" paths
  - lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

- make naming not "a pain in the neck": [1]

- lots of polishing

- revise docs

- run benchmarks - I've done so in the past, but not recently

- perhaps 0014 can be further broken down - it's still uncomfortably large


It's worth noting that the patchset, leaving new tests aside, has a
substantially negative diffstat, even if one includes all the new file
headers... Once a bunch more cleanup is done, I bet it'll improve further.

with new tests:
 80 files changed, 9759 insertions(+), 8051 deletions(-)

without tests (mildly inaccurate):
 71 files changed, 6991 insertions(+), 7814 deletions(-)

just shared memory stats patch, not all the code movement, new file headers:
 49 files changed, 4079 insertions(+), 5472 deletions(-)


Comments and reviews welcome!


I regularly push to https://github.com/anarazel/postgres/tree/shmstat fwiw -
the series is way too big to spam the list all the time.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220303.170412.1542007127371857370.horikyota.ntt%40gmail.com

Вложения

Re: shared-memory based stats collector - v66

От
Melanie Plageman
Дата:
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:
>
> Starting to feel more optimistic about this! There's loads more to do, but now
> the TODOs just seem to require elbow grease, rather than deep thinking.
>
> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs
>
> - add longer explanation of architecture to pgstat.c (or a README)
>
> - Further improve our stats test coverage - there's a crapton not covered,
>   despite 0017:
>   - test WAL replay with stats (stats for dropped tables are removed etc)

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.
Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.

- Melanie

Вложения

Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> Attached is a TAP test to check that stats are cleaned up on a physical
> replica after the objects they concern are dropped on the primary.

Thanks!


> I'm not sure that the extra force next flush on standby is needed after
> drop on primary since drop should report stats and I wait for catchup.

A drop doesn't force stats in other sessions to be flushed immediately, so
unless I misunderstand, yes, it's needed.


> Also, I don't think the tests with DROP SCHEMA actually exercise another
> code path, so it might be worth cutting those.

> +/*
> + * Checks for presence of stats for object with provided object oid of kind
> + * specified in the type string in database of provided database oid.
> + *
> + * For subscription stats, only the objoid will be used. For database stats,
> + * only the dboid will be used. The value passed in for the unused parameter is
> + * discarded.
> + * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
> + */
> +Datum
> +pg_stat_stats_exist(PG_FUNCTION_ARGS)

Should we revoke stats for this one from PUBLIC (similar to the reset functions)?


> +# Set track_functions to all on standby
> +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");

That should already be set, cloning from the primary includes the
configuration from that point in time.

> +$node_standby->restart;

FWIW, it'd also only require a reload....

Greetings,

Andres Freund



Re: shared-memory based stats collector - v66

От
Melanie Plageman
Дата:
On Sun, Mar 20, 2022 at 12:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:
> > Attached is a TAP test to check that stats are cleaned up on a physical
> > replica after the objects they concern are dropped on the primary.
>
> Thanks!
>
>
> > I'm not sure that the extra force next flush on standby is needed after
> > drop on primary since drop should report stats and I wait for catchup.
>
> A drop doesn't force stats in other sessions to be flushed immediately, so
> unless I misunderstand, yes, it's needed.
>
>
> > Also, I don't think the tests with DROP SCHEMA actually exercise another
> > code path, so it might be worth cutting those.
>
> > +/*
> > + * Checks for presence of stats for object with provided object oid of kind
> > + * specified in the type string in database of provided database oid.
> > + *
> > + * For subscription stats, only the objoid will be used. For database stats,
> > + * only the dboid will be used. The value passed in for the unused parameter is
> > + * discarded.
> > + * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
> > + */
> > +Datum
> > +pg_stat_stats_exist(PG_FUNCTION_ARGS)
>
> Should we revoke stats for this one from PUBLIC (similar to the reset functions)?
>
>
> > +# Set track_functions to all on standby
> > +$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");
>
> That should already be set, cloning from the primary includes the
> configuration from that point in time.
>
> > +$node_standby->restart;
>
> FWIW, it'd also only require a reload....
>

Addressed all of these points in
v2-0001-add-replica-cleanup-tests.patch

also added a new test file in
v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
testing correct behavior after a crash and when stats file is invalid

- Melanie

Вложения

Re: shared-memory based stats collector - v67

От
Andres Freund
Дата:
Hi,

Attached is v67 of the patch. Changes:

- I've committed a number of the earlier patches after polishing them some more
- lots of small cleanups, particularly around reducing unnecessary diff noise
- included Melanie's tests


On 2022-03-17 00:36:52 -0700, Andres Freund wrote:
> I think the first few patches are basically ready to be applied and are
> independently worthwhile:
> - 0001-pgstat-run-pgindent-on-pgstat.c-h.patch
> - 0002-pgstat-split-relation-database-stats-handling-ou.patch
> - 0003-pgstat-split-out-WAL-handling-from-pgstat_-initi.patch
> - 0004-pgstat-introduce-pgstat_relation_should_count.patch
> - 0005-pgstat-xact-level-cleanups-consolidation.patch

Committed.


> Might not be worth having separately, should probably just be part of
> 0014:
> - 0006-pgstat-wip-pgstat-relation-init-assoc.patch

Committed parts, the "assoc" stuff was moved into the main shared memory stats
patch.


> A pain to maintain, needs mostly a bit of polishing of file headers. Perhaps I
> should rename pgstat_checkpoint.c to pgstat_checkpointer.c, fits better with
> function names:
> - 0007-pgstat-split-different-types-of-stats-into-separ.patch

Committed.


> This is also painful to maintain. Mostly kept separate from 0007 for easier
> reviewing:
> - 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch

Planning to commit this soon (it's now 0001). Doing a last few passes of
readthrough / polishing.


> I don't yet know what we should do with other users of
> PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
> that pg_stat_statement is enough of a reason to keep the stats_temp_directory
> GUC around?
> - 0019-pgstat-wip-remove-stats_temp_directory.patch

Still unclear. Might raise this separately for higher visibility.


> Right now we reset stats for replicas, even if we start from a shutdown
> checkpoint. That seems pretty unnecessary with this patch:
> - 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Might raise this in another thread for higher visibility.


> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs
> - add longer explanation of architecture to pgstat.c (or a README)
> - make naming not "a pain in the neck": [1]
> - lots of polishing
> - run benchmarks - I've done so in the past, but not recently

Still TBD


> - revise docs

Kyotaro-san, maybe you could do a first pass?


> - Further improve our stats test coverage - there's a crapton not covered,
>   despite 0017:
>   - test WAL replay with stats (stats for dropped tables are removed etc)
>   - test crash recovery and "invalid stats file" paths
>   - lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

That's gotten a lot better with Melanie's tests, still a bit further to go. I
think she's found at least one more small bug that's not yet fixed here.


> - perhaps 0014 can be further broken down - it's still uncomfortably large

Things that I think can be split out:
- Encapsulate "if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)"
  style tests in a helper function. Then just the body needs to be changed,
  rather than a lot of places needing such checks.

Yep, that's it. I don't really see anything else that wouldn't be too
awkward. Would welcome suggestions!.

Greetings,

Andres Freund

Вложения

Re: shared-memory based stats collector - v66

От
Melanie Plageman
Дата:
On Sun, Mar 20, 2022 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Addressed all of these points in
> v2-0001-add-replica-cleanup-tests.patch
>
> also added a new test file in
> v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
> testing correct behavior after a crash and when stats file is invalid
>

Attached is the last of the tests confirming clean up for stats in the
shared stats hashtable (these are for the subscription stats).

I thought that maybe these tests could now use
pg_stat_force_next_flush() instead of poll_query_until() but I wasn't
sure how to ensure that the error has happened and the pending entry has
been added before setting force_next_flush.

I also added in tests that resetting subscription stats works as
expected.

- Melanie

Вложения

Re: shared-memory based stats collector - v67

От
Kyotaro Horiguchi
Дата:
At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> Attached is v67 of the patch. Changes:

Thanks for the lot of work on this. 

> > This is also painful to maintain. Mostly kept separate from 0007 for easier
> > reviewing:
> > - 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch
> 
> Planning to commit this soon (it's now 0001). Doing a last few passes of
> readthrough / polishing.

This looks like committed.

> > I don't yet know what we should do with other users of
> > PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
> > that pg_stat_statement is enough of a reason to keep the stats_temp_directory
> > GUC around?
> > - 0019-pgstat-wip-remove-stats_temp_directory.patch
> 
> Still unclear. Might raise this separately for higher visibility.
> 
> 
> > Right now we reset stats for replicas, even if we start from a shutdown
> > checkpoint. That seems pretty unnecessary with this patch:
> > - 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch
> 
> Might raise this in another thread for higher visibility.
> 
> 
> > The biggest todos are:
> > - Address all the remaining AFIXMEs and XXXs
> > - add longer explanation of architecture to pgstat.c (or a README)
> > - make naming not "a pain in the neck": [1]
> > - lots of polishing
> > - run benchmarks - I've done so in the past, but not recently
> 
> Still TBD


> > - revise docs
> 
> Kyotaro-san, maybe you could do a first pass?

Docs..  Yeah I'll try it.

> > - Further improve our stats test coverage - there's a crapton not covered,
> >   despite 0017:
> >   - test WAL replay with stats (stats for dropped tables are removed etc)
> >   - test crash recovery and "invalid stats file" paths
> >   - lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today
> 
> That's gotten a lot better with Melanie's tests, still a bit further to go. I
> think she's found at least one more small bug that's not yet fixed here.
> 
> 
> > - perhaps 0014 can be further broken down - it's still uncomfortably large
> 
> Things that I think can be split out:
> - Encapsulate "if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)"
>   style tests in a helper function. Then just the body needs to be changed,
>   rather than a lot of places needing such checks.
> 
> Yep, that's it. I don't really see anything else that wouldn't be too
> awkward. Would welcome suggestions!.

I'm overwhelmed by the amout, but I'm going to look into them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector - v66

От
Melanie Plageman
Дата:
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:
> I've attached a substantially improved version of the shared memory stats
> patch.
...
>   - lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

Attached are some tests including tests that reset of stats works for
all views having a reset timestamp as well as a basic test for at least
one column in all of the following stats views:
pg_stat_archiver, pg_stat_bgwriter, pg_stat_wal, pg_stat_slru,
pg_stat_replication_slots, pg_stat_database

It might be nice to have a test for one of the columns fetched from the
PgStatBgwriter data structure since those and the Checkpointer stats are
stored separately despite being displayed in the same view currently.
but, alas...

- Melanie

Вложения

Re: shared-memory based stats collector - v67

От
Kyotaro Horiguchi
Дата:
At Tue, 22 Mar 2022 11:56:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Docs..  Yeah I'll try it.

This is the first cut, based on the earlier patchset.

monitoring.sgml:
>    When using the statistics to monitor collected data, it is important

I failed to read this clearly.  I modified the part assuming that it
means "the statistics" means "the statistics views and functions".


I didn't mention pgstat_force_next_flush() since I think it is a
developer-only feature.


In the attached diff, I refrained to reindent for easy review.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 855456e68e1ddc5cec0a05f9604e257061de56ca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 22 Mar 2022 18:48:43 +0900
Subject: [PATCH] doc_wip

---
 doc/src/sgml/catalogs.sgml          |   6 +-
 doc/src/sgml/config.sgml            |  10 +--
 doc/src/sgml/func.sgml              |   4 +-
 doc/src/sgml/glossary.sgml          |  39 +++++------
 doc/src/sgml/high-availability.sgml |   6 +-
 doc/src/sgml/maintenance.sgml       |   2 +-
 doc/src/sgml/monitoring.sgml        | 103 ++++++++++++++--------------
 doc/src/sgml/ref/pg_dump.sgml       |   2 +-
 8 files changed, 84 insertions(+), 88 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 2a8cd02664..21aebfddc0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9507,9 +9507,9 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
   <para>
    <xref linkend="view-table"/> lists the system views described here.
    More detailed documentation of each view follows below.
-   There are some additional views that provide access to the results of
-   the statistics collector; they are described in <xref
-   linkend="monitoring-stats-views-table"/>.
+   There are some additional views that provide access to the activity
+   statistics; they are described in
+   <xref linkend="monitoring-stats-views-table"/>.
   </para>
 
   <para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a48973b3c..a2e468a727 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7752,12 +7752,12 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
    <sect1 id="runtime-config-statistics">
     <title>Run-time Statistics</title>
 
-    <sect2 id="runtime-config-statistics-collector">
-     <title>Query and Index Statistics Collector</title>
+    <sect2 id="runtime-config-activity-statitics">
+     <title>Query and Index Activity Statistics</title>
 
      <para>
-      These parameters control server-wide statistics collection features.
-      When statistics collection is enabled, the data that is produced can be
+      These parameters control server-wide activity statistics facility.
+      When statistics facility is enabled, the data that is produced can be
       accessed via the <structname>pg_stat</structname> and
       <structname>pg_statio</structname> family of system views.
       Refer to <xref linkend="monitoring"/> for more information.
@@ -7773,7 +7773,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </term>
       <listitem>
        <para>
-        Enables the collection of information on the currently
+        Enables the tracking of the currently
         executing command of each session, along with its identifier and the
         time when that command began execution. This parameter is on by
         default. Note that even when enabled, this information is not
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8a802fb225..a681641162 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25451,8 +25451,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
        <para>
         Requests to log the memory contexts of the backend with the
         specified process ID.  This function can send the request to
-        backends and auxiliary processes except logger and statistics
-        collector.  These memory contexts will be logged at
+        backends and auxiliary processes except logger.  These memory contexts
+        will be logged at
         <literal>LOG</literal> message level. They will appear in
         the server log based on the log configuration set
         (See <xref linkend="runtime-config-logging"/> for more information),
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 1835d0e65a..2215f8ad7d 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -7,6 +7,22 @@
  </para>
 
  <glosslist>
+  <glossentry id="glossary-activity-statistics">
+   <glossterm>Activity statistics facility</glossterm>
+   <glossdef>
+    <para>
+     A facility which, if enabled, collects statistical information
+     about the <glossterm linkend="glossary-instance">instance</glossterm>'s
+     activities.
+    </para>
+    <para>
+      For more information, see
+      <xref linkend="monitoring-stats"/>.
+    </para>
+   </glossdef>
+  </glossentry>
+
+
   <glossentry id="glossary-acid">
    <glossterm>ACID</glossterm>
    <glossdef>
@@ -136,9 +152,9 @@
      The auxiliary processes consist of <!-- in alphabetical order -->
      <!-- NB: In the code, the autovac launcher doesn't use the auxiliary
           process scaffolding; however it does behave as one so we list it
-          here anyway. In addition, logger and stats collector aren't
-          connected to shared memory so most code outside postmaster.c
-          doesn't even consider them "procs" in the first place.
+          here anyway. In addition, logger isn't connected to shared memory so
+          most code outside postmaster.c doesn't even consider them "procs" in
+          the first place.
           -->
      the <glossterm linkend="glossary-autovacuum">autovacuum launcher</glossterm>
      (but not the autovacuum workers),
@@ -146,7 +162,6 @@
      the <glossterm linkend="glossary-checkpointer">checkpointer</glossterm>,
      the <glossterm linkend="glossary-logger">logger</glossterm>,
      the <glossterm linkend="glossary-startup-process">startup process</glossterm>,
-     the <glossterm linkend="glossary-stats-collector">statistics collector</glossterm>,
      the <glossterm linkend="glossary-wal-archiver">WAL archiver</glossterm>,
      the <glossterm linkend="glossary-wal-receiver">WAL receiver</glossterm>
      (but not the <glossterm linkend="glossary-wal-sender">WAL senders</glossterm>),
@@ -1563,22 +1578,6 @@
    </glossdef>
   </glossentry>
 
-  <glossentry id="glossary-stats-collector">
-   <glossterm>Stats collector (process)</glossterm>
-   <glossdef>
-    <para>
-     An <glossterm linkend="glossary-auxiliary-proc">auxiliary process</glossterm>
-     which, if enabled, receives statistical information
-     about the <glossterm linkend="glossary-instance">instance</glossterm>'s
-     activities.
-    </para>
-    <para>
-      For more information, see
-      <xref linkend="monitoring-stats"/>.
-    </para>
-   </glossdef>
-  </glossentry>
-
   <glossentry id="glossary-system-catalog">
    <glossterm>System catalog</glossterm>
    <glossdef>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 81fa26f985..5d0b37dfed 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2227,12 +2227,12 @@ HINT:  You can then restart the server after making the necessary configuration
    </para>
 
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The activity statistics facility is active during recovery. All scans, reads, blocks,
     index usage, etc., will be recorded normally on the standby. Replayed
     actions will not duplicate their effects on primary, so replaying an
     insert will not increment the Inserts column of pg_stat_user_tables.
-    The stats file is deleted at the start of recovery, so stats from primary
-    and standby will differ; this is considered a feature, not a bug.
+    So the stats from primary and standby will differ; this is considered a
+    feature, not a bug.
    </para>
 
    <para>
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 36f975b1e5..507ca594a3 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -799,7 +799,7 @@ vacuum insert threshold = vacuum base insert threshold + vacuum insert scale fac
     it may be beneficial to lower the table's
     <xref linkend="reloption-autovacuum-freeze-min-age"/> as this may allow
     tuples to be frozen by earlier vacuums.  The number of obsolete tuples and
-    the number of inserted tuples are obtained from the statistics collector;
+    the number of inserted tuples are obtained from the activity statistics facility;
     it is a semi-accurate count updated by each <command>UPDATE</command>,
     <command>DELETE</command> and <command>INSERT</command> operation.  (It is
     only semi-accurate because some information might be lost under heavy
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 028f5d676b..f2457e8cb3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -22,7 +22,7 @@
   <para>
    Several tools are available for monitoring database activity and
    analyzing performance.  Most of this chapter is devoted to describing
-   <productname>PostgreSQL</productname>'s statistics collector,
+   <productname>PostgreSQL</productname>'s activity statistics facility,
    but one should not neglect regular Unix monitoring programs such as
    <command>ps</command>, <command>top</command>, <command>iostat</command>, and <command>vmstat</command>.
    Also, once one has identified a
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
 postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
 postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
 postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
 postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
 postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
 postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in
transaction
@@ -63,11 +62,10 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
    platforms, as do the details of what is shown.  This example is from a
    recent Linux system.)  The first process listed here is the
    primary server process.  The command arguments
-   shown for it are the same ones used when it was launched.  The next five
+   shown for it are the same ones used when it was launched.  The next four
    processes are background worker processes automatically launched by the
-   primary process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   primary process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to run autovacuum.)
    Each of the remaining
    processes is a server process handling one client connection.  Each such
    process sets its command line display in the form
@@ -130,16 +128,16 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
  </sect1>
 
  <sect1 id="monitoring-stats">
-  <title>The Statistics Collector</title>
+  <title>The Activity Statistics Facility</title>
 
   <indexterm zone="monitoring-stats">
    <primary>statistics</primary>
   </indexterm>
 
   <para>
-   <productname>PostgreSQL</productname>'s <firstterm>statistics collector</firstterm>
+   <productname>PostgreSQL</productname>'s <firstterm>activity statistics facility</firstterm>
    is a subsystem that supports collection and reporting of information about
-   server activity.  Presently, the collector can count accesses to tables
+   server activity.  Presently, the facility can count accesses to tables
    and indexes in both disk-block and individual-row terms.  It also tracks
    the total number of rows in each table, and information about vacuum and
    analyze actions for each table.  It can also count calls to user-defined
@@ -151,7 +149,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    information about exactly what is going on in the system right now, such as
    the exact command currently being executed by other server processes, and
    which other connections exist in the system.  This facility is independent
-   of the collector process.
+   of the activity statistics facility.
   </para>
 
  <sect2 id="monitoring-stats-setup">
@@ -172,7 +170,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    The parameter <xref linkend="guc-track-counts"/> controls whether
-   statistics are collected about table and index accesses.
+   activity statistics are collected about table and index accesses.
   </para>
 
   <para>
@@ -201,18 +199,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
   </para>
 
   <para>
-   The statistics collector transmits the collected information to other
-   <productname>PostgreSQL</productname> processes through temporary files.
-   These files are stored in the directory named by the
-   <xref linkend="guc-stats-temp-directory"/> parameter,
-   <filename>pg_stat_tmp</filename> by default.
-   For better performance, <varname>stats_temp_directory</varname> can be
-   pointed at a RAM-based file system, decreasing physical I/O requirements.
+   The activity statistics facility collects information in shared
+   memory area. Every <productname>PostgreSQL</productname> process collects
+   activity statistics locally then writes out the summarized data to the
+   shared memory area with appropriate intervals.
    When the server shuts down cleanly, a permanent copy of the statistics
    data is stored in the <filename>pg_stat</filename> subdirectory, so that
    statistics can be retained across server restarts.  When recovery is
-   performed at server start (e.g., after immediate shutdown, server crash,
-   and point-in-time recovery), all statistics counters are reset.
+   performed at server start (e.g., after immediate shutdown, server crash),
+   all statistics counters are reset.
   </para>
 
  </sect2>
@@ -226,19 +221,19 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    the current state of the system. There are also several other
    views, listed in <xref
    linkend="monitoring-stats-views-table"/>, available to show the results
-   of statistics collection.  Alternatively, one can
-   build custom views using the underlying statistics functions, as discussed
-   in <xref linkend="monitoring-stats-functions"/>.
+   of activity statistics collection.  Alternatively, one can
+   build custom views using the underlying activity statistics functions, as
+   discussed in <xref linkend="monitoring-stats-functions"/>.
   </para>
 
   <para>
-   When using the statistics to monitor collected data, it is important
+   When using the activity statistics views and functions to monitor collected data, it is important
    to realize that the information does not update instantaneously.
-   Each individual server process transmits new statistical counts to
-   the collector just before going idle; so a query or transaction still in
-   progress does not affect the displayed totals.  Also, the collector itself
-   emits a new report at most once per <varname>PGSTAT_STAT_INTERVAL</varname>
-   milliseconds (500 ms unless altered while building the server).  So the
+   Each individual server process writes out new statistical counts to
+   the shared memory area just before going idle but not frequently than once
+   per <varname>PGSTAT_MIN_INTERVAL</varname> milliseconds (10 seconds unless
+   altered while building the server); so a query or transaction still in
+   progress does not affect the displayed totals and the
    displayed information lags behind actual activity.  However, current-query
    information collected by <varname>track_activities</varname> is
    always up-to-date.
@@ -246,22 +241,23 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
 
   <para>
    Another important point is that when a server process is asked to display
-   any of these statistics, it first fetches the most recent report emitted by
-   the collector process and then continues to use this snapshot for all
-   statistical views and functions until the end of its current transaction.
-   So the statistics will show static information as long as you continue the
-   current transaction.  Similarly, information about the current queries of
-   all sessions is collected when any such information is first requested
-   within a transaction, and the same information will be displayed throughout
-   the transaction.
-   This is a feature, not a bug, because it allows you to perform several
-   queries on the statistics and correlate the results without worrying that
-   the numbers are changing underneath you.  But if you want to see new
-   results with each query, be sure to do the queries outside any transaction
-   block.  Alternatively, you can invoke
-   <function>pg_stat_clear_snapshot</function>(), which will discard the
-   current transaction's statistics snapshot (if any).  The next use of
-   statistical information will cause a new snapshot to be fetched.
+   any of these activity statistics, it fetches the status-quo values in the
+   shared memory area every time.  So the result can vary every time fetching
+   a same item even during a transaction. This is a feature, not a bug, for
+   the reason that generally these values don't need transactional consistency
+   and they are prefered to be as fresh as possible.  If needed, transactional
+   consistency is available by
+   setting <varname>stats_fetch_consistency</varname>
+   to <literal>snapshot</literal> or <literal>cache</literal>.  When it is set
+   to <literal>snapshot</literal>, a consistent snapshot through all items is
+   created at the first fetch to any item and consequent fetches use the
+   snapshot, which persists until the transaction ends.  When it is set
+   to <literal>cache</literal>, the value for an item is individually cached
+   at the first fetch and the cached value persists until the transaction
+   ends.  You can invoke <function>pg_stat_clear_snapshot</function>() to
+   discard the current transaction's statistics snapshot (if any) anytime.
+   The next use of statistical information will cause a new snapshot or a new
+   cache to be fetched.
   </para>
 
   <para>
@@ -1110,10 +1106,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry><literal>LogicalLauncherMain</literal></entry>
       <entry>Waiting in main loop of logical replication launcher process.</entry>
      </row>
-     <row>
-      <entry><literal>PgStatMain</literal></entry>
-      <entry>Waiting in main loop of statistics collector process.</entry>
-     </row>
      <row>
       <entry><literal>RecoveryWalStream</literal></entry>
       <entry>Waiting in main loop of startup process for WAL to arrive, during
@@ -1889,6 +1881,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
     </thead>
 
     <tbody>
+     <row>
+      <entry><literal>ActivityStatistics</literal></entry>
+      <entry>Waiting to write out activity statistics to shared memory.</entry>
+     </row>
      <row>
       <entry><literal>AddinShmemInit</literal></entry>
       <entry>Waiting to manage an extension's space allocation in shared
@@ -5082,7 +5078,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   </para>
 
   <para>
-   Additional functions related to statistics collection are listed in <xref
+   Additional functions related to the activity statistics facility are listed in <xref
    linkend="monitoring-stats-funcs-table"/>.
   </para>
 
@@ -5152,7 +5148,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
         <returnvalue>void</returnvalue>
        </para>
        <para>
-        Discards the current statistics snapshot.
+        Discards the current statistics snapshot or cached information.
        </para></entry>
       </row>
 
@@ -6313,8 +6309,9 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
      <entry>
        <command>VACUUM</command> is performing final cleanup.  During this phase,
        <command>VACUUM</command> will vacuum the free space map, update statistics
-       in <literal>pg_class</literal>, and report statistics to the statistics
-       collector.  When this phase is completed, <command>VACUUM</command> will end.
+       in <literal>pg_class</literal>, and report statistics to the activity
+       statistics facility.
+       When this phase is completed, <command>VACUUM</command> will end.
      </entry>
     </row>
    </tbody>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0042fd96..64d4d997c7 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1328,7 +1328,7 @@ PostgreSQL documentation
 
   <para>
    The database activity of <application>pg_dump</application> is
-   normally collected by the statistics collector.  If this is
+   normally collected by the activity statistics facility.  If this is
    undesirable, you can set parameter <varname>track_counts</varname>
    to false via <envar>PGOPTIONS</envar> or the <literal>ALTER
    USER</literal> command.
-- 
2.27.0


Re: shared-memory based stats collector - v67

От
Kyotaro Horiguchi
Дата:
At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > Right now we reset stats for replicas, even if we start from a shutdown
> > checkpoint. That seems pretty unnecessary with this patch:
> > - 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch
> 
> Might raise this in another thread for higher visibility.

+    /*
+     * When starting with crash recovery, reset pgstat data - it might not be
+     * valid. Otherwise restore pgstat data. It's safe to do this here,
+     * because postmaster will not yet have started any other processes
+     *
+     * TODO: With a bit of extra work we could just start with a pgstat file
+     * associated with the checkpoint redo location we're starting from.
+     */
+    if (ControlFile->state == DB_SHUTDOWNED ||
+        ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
+        pgstat_restore_stats();
+    else
+        pgstat_discard_stats();
+

Before there, InitWalRecovery changes the state to
DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
DB_IN_PRODUCTION. So the stat seems like always discarded on standby.

In the first place, I'm not sure it is valid that a standby from a
cold backup takes over the stats from the primary.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector - v67

От
Andres Freund
Дата:
Hi,

On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:
> At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in 
> > > Right now we reset stats for replicas, even if we start from a shutdown
> > > checkpoint. That seems pretty unnecessary with this patch:
> > > - 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch
> > 
> > Might raise this in another thread for higher visibility.
> 
> +    /*
> +     * When starting with crash recovery, reset pgstat data - it might not be
> +     * valid. Otherwise restore pgstat data. It's safe to do this here,
> +     * because postmaster will not yet have started any other processes
> +     *
> +     * TODO: With a bit of extra work we could just start with a pgstat file
> +     * associated with the checkpoint redo location we're starting from.
> +     */
> +    if (ControlFile->state == DB_SHUTDOWNED ||
> +        ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
> +        pgstat_restore_stats();
> +    else
> +        pgstat_discard_stats();
> +
> 
> Before there, InitWalRecovery changes the state to
> DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
> DB_IN_PRODUCTION. So the stat seems like always discarded on standby.

Hm. I though it worked at some point. I guess there's a reason this commit is
a separate commit marked WIP ;)


> In the first place, I'm not sure it is valid that a standby from a
> cold backup takes over the stats from the primary.

I don't really see a reason not to use the stats in that case - we have a
correct stats file after all. But it doesn't seem too important. What I
actually find worth addressing is the case of standbys starting in
DB_SHUTDOWNED_IN_RECOVERY. Right now we always throw stats away after a
*graceful* restart of a standby, which doesn't seem great.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v66

От
Melanie Plageman
Дата:
On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:
>
> The biggest todos are:
> - Address all the remaining AFIXMEs and XXXs

Attached is a patch that addresses three of the existing AFIXMEs.

Вложения

Re: shared-memory based stats collector - v66

От
Kyotaro Horiguchi
Дата:
At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in 
> On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > The biggest todos are:
> > - Address all the remaining AFIXMEs and XXXs
> 
> Attached is a patch that addresses three of the existing AFIXMEs.

Thanks!

+        .reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,

(I once misunderstood that the "shared" means shared memory area..)

The reset function is type-specific and it must be set.  So don't we
provide all to-be-required reset functions?


+    if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+    {
+        Oid msg_oid = (kind == PGSTAT_KIND_DB) ? dboid : objoid;

Explicitly using PGSTAT_KIND_DB here is a kind of annoyance.  Since we
always give InvalidOid correctly as the parameters, and objoid alone
is not specific enough, do we warn using both dboid and objoid without
a special treat?

Concretely, I propose to do the following instead.

+    if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+    {
+        ereport(WARNING,
+                errmsg("resetting existing stats for type %s, db=%d, oid=%d",
+                      pgstat_kind_info_for(kind)->name, dboid, objoid);                



+pgstat_pending_delete(PgStatSharedRef *shared_ref)
+{
+    void       *pending_data = shared_ref->pending;
+    PgStatKind kind = shared_ref->shared_entry->key.kind;
+
+    Assert(pending_data != NULL);
+    Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+
+    /* PGSTAT_KIND_TABLE has its own callback */
+    Assert(kind != PGSTAT_KIND_TABLE);
+

"kind" is used only in assertion, which requires PG_USED_FOR_ASSERTS_ONLY.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector - v66

От
Kyotaro Horiguchi
Дата:
At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in 
> > On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > The biggest todos are:
> > > - Address all the remaining AFIXMEs and XXXs
> > 
> > Attached is a patch that addresses three of the existing AFIXMEs.

I'd like to dump out my humble thoughts about other AFIXMEs..

> AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> for increasing it?

It is 1000ms in the comment just above but actually 10000ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 10000ms.  So the expected maximum flush
frequency is reduces to 100 times per second.  Of course it is
assuming the worst case and the 10000ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s.  This is a kind of
auto-averaging mechanishm.  It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.


> AFIXME: architecture explanation.

Mmm. next, please:p


(    [PGSTAT_KIND_REPLSLOT] = {)
> * AFIXME: With a bit of extra work this could now be a !fixed_amount
> * stats kind.

Yeah.  The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life.  It can be resolved by
allocating an object id to every replication slot.  I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

> static Size
> pgstat_dsa_init_size(void)
> {
>     /*
>      * AFIXME: What should we choose as an initial size? Should we make this
>      * configurable? Maybe tune based on NBuffers?

> StatsShmemInit(void)
>          * AFIXME: we need to guarantee this can be allocated in plain shared
>          * memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects.  If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

   WARNING: dsa allocation happened for activity statistics
   HINT: You might want to increase stat_dsa_initial_size if you see slow
         down blah..


> * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?


>  * AFIXME: comment
>  * AFIXME: see notes about race conditions for functions in
>  *         pgstat_drop_function().
>  */
> void
> pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)


pgstat_drop_function() doesn't seem to have such a note.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend.  It seems to me such a race condition is eliminated by the
transactional drop mechanism.  Are you intending to write an
explanation of that?


>    /*
>     * pgStatSharedRefAge increments quite slowly than the time the following
>     * loop takes so this is expected to iterate no more than twice.
>     *
>     * AFIXME: Why is this a good place to do this?
>     */
>    while (pgstat_shared_refs_need_gc())
>        pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?


> pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> {
...
>         * AFIXME: this probably is racy. Another backend could look up the
>         * stat, bump the refcount, as we free it.
>        if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
>        {
...
>            /* only dropped entries can reach a 0 refcount */
>            Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?



>  * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
>  * it's not entirely trivial, because we can't use plain dshash_delete_entry()
>  * (but have to use dshash_delete_current()).
>  */
> static bool
> pgstat_drop_stats_entry(dshash_seq_status *hstat)
...
>      * AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function?  Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?


>         * Database stats contain other stats. Drop those as well when
>         * dropping the database. AFIXME: Perhaps this should be done in a
>         * slightly more principled way?
>         */
>        if (key.kind == PGSTAT_KIND_DB)
>            pgstat_drop_database_and_contents(key.dboid);

I tend to agree to that and it is possible that we have
PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
it is really needed only by PGSTAT_KIND_DB..


>  * AFIXME: consistent naming
>  * AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
>  *
>  * AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
>  * location, instead of putting PgStatSnapshot into pgstat_internal.h
>  */
> void
> pgstat_snapshot_global(PgStatKind kind)


Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
for deduplication? But I'm afraid that harms in some way.

For the memory location, it seems like a matter of taste, but if we
don't need a multiple copies of a global snapshot, I think
.snapshot_cb() doesn't need to take the target memory location.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> I'd like to dump out my humble thoughts about other AFIXMEs..

Thanks!  Please have another look at the code in
https://github.com/anarazel/postgres/tree/shmstat I just pushed a revised
version with a lot of [a]fixmes removed.


Most importantly I did move replication slot stats into the hash table, and
just generally revised the replication slot stats code substantially. I
think it does look better now.

But also there's a new commit allowing dsm use in single user mode. To be able
to rely on stats drops we need to perform them even in single user mode. The
only reason this didn't previously fail was that we allocated enough "static"
shared memory for single user mode to never need DSMs.

Thanks to Melanie's tests, and a few more additions by myself, the code is now
reasonably well covered. The big exception to that is recovery conflict stats,
and as Melanie noticed, that was broken (somehow pgstat_database_flush_cb()
didn't sum them up)). I think she has some WIP tests...

Re the added tests: I did fix a few timing issues there. There's probably a
few more hiding somewhere.


I also found that unfortunately dshash_seq_next() as is isn't correct. I
included a workaround commit, but it's not correct. What we need to do is to
just always lock partition 0 in the initialization branch. Before we call
ensure_valid_bucket_pointers() status->hash_table->size_log2 isn't valid. And
ensure_valid_bucket_pointers can only be called with a lock...



Horiguchi-san, if you have time to look at the "XXX: The following could now be
generalized" in pgstat_read_statsfile(), pgstat_write_statsfile()... I think
that'd be nice to clean up.



> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 10000ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent.

Have you measured this (recently)? I tried to cause contention with a workload
targeted towards that, but couldn't see a problem with 1000ms. Of course
there's a problem with 1ms...

I think it's confusing to not report stats for 10s without a need.


> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

Yea, I think the 60s part under contention is fine. I'd expect that to be
rarely reached.


> 
> > AFIXME: architecture explanation.
> 
> Mmm. next, please:p

Working on it. There's one more AFIXME that I want to resolve before, so I
don't end up with old type names strewn around (the one in pgstat_internal.h).


> 
> (    [PGSTAT_KIND_REPLSLOT] = {)
> > * AFIXME: With a bit of extra work this could now be a !fixed_amount
> > * stats kind.
> 
> Yeah.  The most bothersome point is the slot index is not persistent
> at all and the relationship between the index and name (or identity)
> is not stable even within a process life.  It can be resolved by
> allocating an object id to every replication slot.  I faintly remember
> of a discussion like that but I don't have a clear memory of the
> discussion.

I think it's resolved now. pgstat_report_replslot* all get the ReplicationSlot
as a parameter. They use the new ReplicationSlotIndex() to get an index from
that. pgstat_report_replslot_(create|acquire) ensure that the relevant index
doesn't somehow contain old stats.

To deal with indexes changing / slots getting removed during restart, there's
now a new callback made during pgstat_read_statsfile() to build the key from
the serialized NameStr. That can return false if a slot of that name is not
know, or use ReplicationSlotIndex() to get the index to store in-memory stats.


> > static Size
> > pgstat_dsa_init_size(void)
> > {
> >     /*
> >      * AFIXME: What should we choose as an initial size? Should we make this
> >      * configurable? Maybe tune based on NBuffers?
> 
> > StatsShmemInit(void)
> >          * AFIXME: we need to guarantee this can be allocated in plain shared
> >          * memory, rather than allocating dsm segments.
> 
> I'm not sure that NBuffers is the ideal base for deciding the required
> size since it doesn't seem to be generally in proportion with the
> number of database objects.  If we made it manually-tunable, we will
> be able to emit a log when DSM segment allocation happens for this use
> as as the tuning aid..
> 
>    WARNING: dsa allocation happened for activity statistics
>    HINT: You might want to increase stat_dsa_initial_size if you see slow
>          down blah..

FWIW, I couldn't find any performance impact from using DSM. Because of the
"PgStatSharedRef" layer, there's not actually that much interaction with the
dsm code...

I reduced the initial allocation to 256kB. Unfortunately that's currently the
minimum that allows dshash_create() to succeed (due to dsa.c pre-allocating 16
of each allocation). I was a bit worried about that for a while, but memory
usage is still lower with the patch than before in the scenarios I tested. We
can probably improve upon that fairly easily in the future (move
dshash_table_control into static shared memory, call dsa_trim() when resizing
dshash table).


> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

Maybe. Somehow it doesn't seem *great* either.


> >  * AFIXME: comment
> >  * AFIXME: see notes about race conditions for functions in
> >  *         pgstat_drop_function().
> >  */
> > void
> > pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)
> 
> 
> pgstat_drop_function() doesn't seem to have such a note.

Yea, I fixed it in pgstat_init_function_usage(), forgetting about the node in
pgstat_schedule_stat_drop(). There's a decently long comment in
pgstat_init_function_usage() explaining the problem.


> I suppose the "race condition" means the case a stats entry for an
> object is created just after the same object is dropped on another
> backend.  It seems to me such a race condition is eliminated by the
> transactional drop mechanism.  Are you intending to write an
> explanation of that?

Yes, I definitely plan to write a bit more about that.


> 
> >    /*
> >     * pgStatSharedRefAge increments quite slowly than the time the following
> >     * loop takes so this is expected to iterate no more than twice.
> >     *
> >     * AFIXME: Why is this a good place to do this?
> >     */
> >    while (pgstat_shared_refs_need_gc())
> >        pgstat_shared_refs_gc();
> 
> Is the reason for the AFIXME is you think that GC-check happens too
> frequently?

Well, the while () loop makes me "suspicious" when looking at the code. I've
now made it an if (), I can't see a reason why we'd need a while()?

I just moved a bunch of that code around, there's probably a bit more polish
needed.


> > pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
> > {
> ...
> >         * AFIXME: this probably is racy. Another backend could look up the
> >         * stat, bump the refcount, as we free it.
> >        if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
> >        {
> ...
> >            /* only dropped entries can reach a 0 refcount */
> >            Assert(shared_ref->shared_entry->dropped);
> 
> I didn't deeply examined, but is that race condition avoidable by
> prevent pgstat_shared_ref_get from incrementing the refcount of
> dropped entries?

I don't think the race exists anymore. I've now revised the relevant code.


> >  * AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
> >  * it's not entirely trivial, because we can't use plain dshash_delete_entry()
> >  * (but have to use dshash_delete_current()).
> >  */
> > static bool
> > pgstat_drop_stats_entry(dshash_seq_status *hstat)
> ...
> >      * AFIXME: don't do this while holding the dshash lock.
> 
> Is the AFIXMEs mean that we should move the call to
> pgstat_shared_ref_release() out of the dshash-loop (in
> pgstat_drop_database_and_contents) that calls this function?  Is it
> sensible if we store the (key, ref) pairs for to-be released
> shared_refs then clean up them after exiting the loop?

I think this is now resolved. The release now happens separately, without
nested locks. See pgstat_shared_refs_release_db() call in
pgstat_drop_database_and_contents().


> 
> >         * Database stats contain other stats. Drop those as well when
> >         * dropping the database. AFIXME: Perhaps this should be done in a
> >         * slightly more principled way?
> >         */
> >        if (key.kind == PGSTAT_KIND_DB)
> >            pgstat_drop_database_and_contents(key.dboid);
> 
> I tend to agree to that and it is possible that we have
> PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
> it is really needed only by PGSTAT_KIND_DB..

Yea, I came to the same conclusion, namely that we don't need something better
for now.


> >  * AFIXME: consistent naming
> >  * AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
> >  *
> >  * AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
> >  * location, instead of putting PgStatSnapshot into pgstat_internal.h
> >  */
> > void
> > pgstat_snapshot_global(PgStatKind kind)
> 
> 
> Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
> for deduplication? But I'm afraid that harms in some way.

I think I made it a bit nicer now, without needing either of those. I'd like
to remove "global" from those functions, it's not actually that obvious what
it means.


> For the memory location, it seems like a matter of taste, but if we
> don't need a multiple copies of a global snapshot, I think
> .snapshot_cb() doesn't need to take the target memory location.

I think it's ok for now. It'd be a bit nicer if we didn't need PgStatSnapshot
/ stats_snapshot in pgstat_internal.h, but it's ok that way I think.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v67

От
Andres Freund
Дата:
Hi,

On 2022-03-23 16:38:33 +0900, Kyotaro Horiguchi wrote:
> At Tue, 22 Mar 2022 11:56:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Docs..  Yeah I'll try it.
> 
> This is the first cut, based on the earlier patchset.

Thanks!


> I didn't mention pgstat_force_next_flush() since I think it is a
> developer-only feature.

Yes, that makes sense.


Sorry for not yet getting back to looking at this.

One thing we definitely need to add documentation for is the
stats_fetch_consistency GUC. I think we should change its default to 'cache',
because that still gives the ability to "self-join", without the cost of the
current method.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 10000ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent. I
> raised it to 5000ms, then 10000ms.  So the expected maximum flush
> frequency is reduces to 100 times per second.  Of course it is
> assuming the worst case and the 10000ms is apparently too long for the
> average cases.
> 
> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true).  But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?


I think we need to do something with the pgstat_report_stat() calls outside of
postgres.c. Otherwise there's nothing limiting their reporting delay, because
they don't have the timeout logic postgres.c has.  None of them is ever hot
enough to be problematic, so I think we should just make them pass force=true?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

I wasn't initially happy with that suggestion, but after running with it, it
looks pretty good.

I also moved a fair bit of code into pgstat_shm.c, which to me improved code
navigation a lot. I'm wondering about splitting it further even, into
pgstat_shm.c and pgstat_entry.c.

What do you think?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

New version of the shared memory stats patchset. Most important changes:

- It's now "cumulative statistics system", as discussed at [1]. This basically
  is now the term that all the references to the "stats collector" are
  replaced with. Looks much better than "activity statistics" imo. The
  STATS_COLLECTOR is now named STATS_CUMULATIVE.  I tried to find all
  references to either collector or "activity statistics", but in all
  likelihood I didn't get them all.

- updated docs (significantly edited version of the version Kyotaro posted a
  few days ago)

- significantly improved test coverage - pgstat*.c are nearly completely
  covered. While pgstatsfuncs.c coverage has increased, it is not great - but
  there's already so much more coverage, that I think it's good enough for
  now. Thanks to Melanie for help with this!

- largely cleaned up inconsisten function / type naming. Everything now /
  again is under the PgStats_ prefix, except for statistics in shared memory,
  which is prefixed with PgStatsShared_.  I think we should go further and
  add at least a PgStatsPending_ namespace, but that requires touching plenty
  code that didn't need to be touched so far, so it'll have to be task for
  another release.

- As discussed in [2] I added a patch at the start of the queue to clean up
  the inconsistent function header comments conventions.

- pgstat.c is further split. Two new files: pgstat_xact.c and pgstat_shmem.c
  (wrote an email about this a few days ago, without sending the patches)

- Split out as much as I could into separate commits.

- Cleaned up autovacuum.c changes - mostly removing more obsolted code

- code, comment polishing



Still todo:
- docs need review
- finish writing architectural comment atop pgstat.c
- fix the bug around pgstat_report_stat() I wrote about at [3]
- collect who reviewed earlier revisions
- choose what conditions for stats file reset we want
- I'm wondering if the solution for replication slot names on disk is too
  narrow, and instead we should have a more general "serialize" /
  "deserialize" callback. But we can easily do that later as well...


There's a bit more inconsistency around function naming. Right now all
callbacks are pgstat_$kind_$action_cb, but most of the rest of pgstat is
pgstat_$action_$kind.  But somehow it "feels" wrong for the callbacks -
there's also a bunch of functions already named similarly, but that's
partially my fault in commits in the past.


There are a lot of copies of "Permission checking for this function is managed
through the normal GRANT system." in the pre-existing code. Aren't they
completely bogus? None of the functions commented upon like that is actually
exposed to SQL!


Please take a look!


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20220308205351.2xcn6k4x5yivcxyd%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/20220329191727.mzzwbl7udhpq7pmf%40alap3.anarazel.de
[3] https://www.postgresql.org/message-id/20220402081648.kbapqdxi2rr3ha3w@alap3.anarazel.de

Вложения

Re: shared-memory based stats collector - v68

От
Thomas Munro
Дата:
On Mon, Apr 4, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> Please take a look!

A few superficial comments:

> [PATCH v68 01/31] pgstat: consistent function header formatting.
> [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.

+1

> [PATCH v68 03/31] dshash: revise sequential scan support.

Logic looks good.  That is,
lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense.  Just
some comment trivia:

+ * dshash_seq_term needs to be called when a scan finished.  The caller may
+ * delete returned elements midst of a scan by using dshash_delete_current()
+ * if exclusive = true.

s/scan finished/scan is finished/
s/midst of/during/ (or /in the middle of/, ...)

> [PATCH v68 04/31] dsm: allow use in single user mode.

LGTM.

+   Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

(Not this patch's fault, but I wish we had a more explicit way to say "am
single user".)

> [PATCH v68 05/31] pgstat: stats collector references in comments

LGTM.  I could think of some alternative suggested names for this subsystem,
but don't think it would be helpful at this juncture so I will refrain :-)

> [PATCH v68 06/31] pgstat: add pgstat_copy_relation_stats().
> [PATCH v68 07/31] pgstat: move transactional code into pgstat_xact.c.

LGTM.

> [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.

+#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of kinds,
because there is no kind 0.  For the two users of it... maybe just use
pgstat_kind_infos[] = {...}, and
global_valid[PGSTAT_KIND_LAST + 1]?

> [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / drop.

+   /*
+    * Dropping the statistics for objects that dropped transactionally itself
+    * needs to be transactional. ...

Hard to parse.  How about:  "Objects are dropped transactionally, so
related statistics need to be dropped transactionally too."

> [PATCH v68 13/31] pgstat: store statistics in shared memory.

+ * Single-writer stats use the changecount mechanism to achieve low-overhead
+ * writes - they're obviously performance critical than reads. Check the
+ * definition of struct PgBackendStatus for some explanation of the
+ * changecount mechanism.

Missing word "more" after obviously?

+    /*
+     * Whenever the for a dropped stats entry could not be freed (because
+     * backends still have references), this is incremented, causing backends
+     * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
+     */
+    pg_atomic_uint64 gc_count;

Whenever the ...?

Would it be better to call this variable gc_request_count?

+     * Initialize refcount to 1, marking it as valid / not tdroped. The entry

s/tdroped/dropped/

+     * further if a longer lived references is needed.

s/references/reference/

+            /*
+             * There are legitimate cases where the old stats entry might not
+             * yet have been dropped by the time it's reused. The easiest case
+             * are replication slot stats. But oid wraparound can lead to
+             * other cases as well. We just reset the stats to their plain
+             * state.
+             */
+            shheader = pgstat_reinit_entry(kind, shhashent);

This whole comment is repeated in pgstat_reinit_entry and its caller.

+    /*
+     * XXX: Might be worth adding some frobbing of the allocation before
+     * freeing, to make it easier to detect use-after-free style bugs.
+     */
+    dsa_free(pgStatLocal.dsa, pdsa);

FWIW dsa_free() clobbers memory in assert builds, just like pfree().

+static Size
+pgstat_dsa_init_size(void)
+{
+    Size        sz;
+
+    /*
+     * The dshash header / initial buckets array needs to fit into "plain"
+     * shared memory, but it's beneficial to not need dsm segments
+     * immediately. A size of 256kB seems works well and is not
+     * disproportional compared to other constant sized shared memory
+     * allocations. NB: To avoid DSMs further, the user can configure
+     * min_dynamic_shared_memory.
+     */
+    sz = 256 * 1024;

It kinda bothers me that the memory reserved by
min_dynamic_shared_memory might eventually fill up with stats, and not
be available for temporary use by parallel queries (which can benefit
more from fast acquire/release on DSMs, and probably also huge pages,
or maybe not...), and that's hard to diagnose.

+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so

s/idle-state-/idle-stats-/

+    /*
+     * Some of the pending stats may have not been flushed due to lock
+     * contention.  If we have such pending stats here, let the caller know
+     * the retry interval.
+     */
+    if (partial_flush)
+    {

I think it's better for a comment that is outside the block to say "If
some of the pending...".  Or the comment should be inside the blocks.

+static void
+pgstat_build_snapshot(void)
+{
...
+    dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
...
+        entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
...
+    }
+    dshash_seq_term(&hstat);

Doesn't allocation failure leave the shared hash table locked?

> PATCH v68 16/31] pgstat: add pg_stat_exists_stat() for easier testing.

pg_stat_exists_stat() is a weird name, ... would it be better as
pg_stat_object_exists()?

> [PATCH v68 28/31] pgstat: update docs.

+        Determines the behaviour when cumulative statistics are accessed

AFAIK our manual is written in en_US, so s/behaviour/behavior/.

+        memory. When set to <literal>cache</literal>, the first access to
+        statistics for an object caches those statistics until the end of the
+        transaction / until <function>pg_stat_clear_snapshot()</function> is

s|/|unless|

+         <literal>none</literal> is most suitable for monitoring solutions. If

I'd change "solutions" to "tools" or maybe "systems".

+   When using the accumulated statistics views and functions to
monitor collected data, it is important

Did you intend to write "accumulated" instead of "cumulative" here?

+   You can invoke <function>pg_stat_clear_snapshot</function>() to discard the
+   current transaction's statistics snapshot / cache (if any).  The next use

I'd change s|/ cache|or cached values|.  I think "/" like this is an informal
thing.



Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-05 01:16:04 +1200, Thomas Munro wrote:
> On Mon, Apr 4, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> > Please take a look!
> 
> A few superficial comments:
> 
> > [PATCH v68 01/31] pgstat: consistent function header formatting.
> > [PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.
> 
> +1

Planning to commit these after making another coffee and proof reading them
some more.


> > [PATCH v68 03/31] dshash: revise sequential scan support.
> 
> Logic looks good.  That is,
> lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense.  Just
> some comment trivia:
> 
> + * dshash_seq_term needs to be called when a scan finished.  The caller may
> + * delete returned elements midst of a scan by using dshash_delete_current()
> + * if exclusive = true.
> 
> s/scan finished/scan is finished/
> s/midst of/during/ (or /in the middle of/, ...)
> 
> > [PATCH v68 04/31] dsm: allow use in single user mode.
> 
> LGTM.


> +   Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

> (Not this patch's fault, but I wish we had a more explicit way to say "am
> single user".)

Agreed.


> > [PATCH v68 05/31] pgstat: stats collector references in comments
> 
> LGTM.  I could think of some alternative suggested names for this subsystem,
> but don't think it would be helpful at this juncture so I will refrain :-)

Heh. I did start a thread about it a while ago :)


> > [PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.
> 
> +#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
> +#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
> +#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
> 
> It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of kinds,
> because there is no kind 0.  For the two users of it... maybe just use
> pgstat_kind_infos[] = {...}, and
> global_valid[PGSTAT_KIND_LAST + 1]?

Maybe the whole justification for not defining an invalid kind is moot
now. There's not a single switch covering all kinds of stats left, and I hope
that we don't introduce one again...


> > [PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / drop.
> 
> +   /*
> +    * Dropping the statistics for objects that dropped transactionally itself
> +    * needs to be transactional. ...
> 
> Hard to parse.  How about:  "Objects are dropped transactionally, so
> related statistics need to be dropped transactionally too."

Not all objects are dropped transactionally. But I agree it reads awkwardly. I
now, incorporating feedback from Justin as well, rephrased it to:

    /*
     * Statistics for transactionally dropped objects need to be
     * transactionally dropped as well. Collect the stats dropped in the
     * current (sub-)transaction and only execute the stats drop when we know
     * if the transaction commits/aborts. To handle replicas and crashes,
     * stats drops are included in commit / abort records.
     */

A few too many "drop"s in there, but maybe that's unavoidable.



> +    /*
> +     * Whenever the for a dropped stats entry could not be freed (because
> +     * backends still have references), this is incremented, causing backends
> +     * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
> +     */
> +    pg_atomic_uint64 gc_count;
> 
> Whenever the ...?

     * Whenever statistics for dropped objects could not be freed - because
     * backends still have references - the dropping backend calls
     * pgstat_request_entry_refs_gc() incrementing this counter. Eventually
     * that causes backends to run pgstat_gc_entry_refs(), allowing memory to
     * be reclaimed.


> Would it be better to call this variable gc_request_count?

Agreed.


> +     * Initialize refcount to 1, marking it as valid / not tdroped. The entry
> 
> s/tdroped/dropped/
> 
> +     * further if a longer lived references is needed.
> 
> s/references/reference/

Fixed.


> +            /*
> +             * There are legitimate cases where the old stats entry might not
> +             * yet have been dropped by the time it's reused. The easiest case
> +             * are replication slot stats. But oid wraparound can lead to
> +             * other cases as well. We just reset the stats to their plain
> +             * state.
> +             */
> +            shheader = pgstat_reinit_entry(kind, shhashent);
> 
> This whole comment is repeated in pgstat_reinit_entry and its caller.

I guess I felt as indecisive about where to place it between the two locations
when I wrote it as I do now. Left it at the callsite for now.


> +    /*
> +     * XXX: Might be worth adding some frobbing of the allocation before
> +     * freeing, to make it easier to detect use-after-free style bugs.
> +     */
> +    dsa_free(pgStatLocal.dsa, pdsa);
> 
> FWIW dsa_free() clobbers memory in assert builds, just like pfree().

Oh. I could swear I saw that not being the case a while ago. But clearly it is
the case. Removed.


> +static Size
> +pgstat_dsa_init_size(void)
> +{
> +    Size        sz;
> +
> +    /*
> +     * The dshash header / initial buckets array needs to fit into "plain"
> +     * shared memory, but it's beneficial to not need dsm segments
> +     * immediately. A size of 256kB seems works well and is not
> +     * disproportional compared to other constant sized shared memory
> +     * allocations. NB: To avoid DSMs further, the user can configure
> +     * min_dynamic_shared_memory.
> +     */
> +    sz = 256 * 1024;
> 
> It kinda bothers me that the memory reserved by
> min_dynamic_shared_memory might eventually fill up with stats, and not
> be available for temporary use by parallel queries (which can benefit
> more from fast acquire/release on DSMs, and probably also huge pages,
> or maybe not...), and that's hard to diagnose.

It's not great, but I don't really see an alternative? The saving grace is
that it's hard to imagine "real" usages of min_dynamic_shared_memory being
used up by stats.


> +         * (4) turn off the idle-in-transaction, idle-session and
> +         * idle-state-update timeouts if active.  We do this before step (5) so
> 
> s/idle-state-/idle-stats-/
> 
> +    /*
> +     * Some of the pending stats may have not been flushed due to lock
> +     * contention.  If we have such pending stats here, let the caller know
> +     * the retry interval.
> +     */
> +    if (partial_flush)
> +    {
> 
> I think it's better for a comment that is outside the block to say "If
> some of the pending...".  Or the comment should be inside the blocks.

The comment says "if" in the second sentence? But it's a bit awkward anyway,
rephrased to:

     * If some of the pending stats could not be flushed due to lock
     * contention, let the caller know when to retry.



> +static void
> +pgstat_build_snapshot(void)
> +{
> ...
> +    dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
> +    while ((p = dshash_seq_next(&hstat)) != NULL)
> +    {
> ...
> +        entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
> ...
> +    }
> +    dshash_seq_term(&hstat);
> 
> Doesn't allocation failure leave the shared hash table locked?

The shared table itself not - the error path does LWLockReleaseAll(). The
problem is the backend local dshash_table, specifically
find_[exclusively_]locked will stay set, and then cause assertion failures
when used next.

I think we need to fix that in dshash.c. We have code in released branches
that's vulnerable to this problem. E.g.
ensure_record_cache_typmod_slot_exists() in lookup_rowtype_tupdesc_internal().

See also
https://postgr.es/m/20220311012712.botrpsikaufzteyt%40alap3.anarazel.de

Afaics the only real choice is to remove find_[exclusively_]locked and rely on
LWLockHeldByMeInMode() instead.


> > PATCH v68 16/31] pgstat: add pg_stat_exists_stat() for easier testing.
> 
> pg_stat_exists_stat() is a weird name, ... would it be better as
> pg_stat_object_exists()?

I was fighting with this one a bunch :). Earlier it was called
pg_stat_stats_exist() I think. "object" makes it sound a bit too much like
it's the database object?

Maybe pg_stat_have_stat()?


> > [PATCH v68 28/31] pgstat: update docs.
> 
> +        Determines the behaviour when cumulative statistics are accessed
> 
> AFAIK our manual is written in en_US, so s/behaviour/behavior/.

Fixed like 10 instances of this in the patchset. Not sure why I just can't
make myself type behavior.


> +        memory. When set to <literal>cache</literal>, the first access to
> +        statistics for an object caches those statistics until the end of the
> +        transaction / until <function>pg_stat_clear_snapshot()</function> is
> 
> s|/|unless|
> 
> +         <literal>none</literal> is most suitable for monitoring solutions. If
> 
> I'd change "solutions" to "tools" or maybe "systems".

Done.


> +   When using the accumulated statistics views and functions to
> monitor collected data, it is important
> 
> Did you intend to write "accumulated" instead of "cumulative" here?

Not sure. I think I got bored of the word at some point :P


> +   You can invoke <function>pg_stat_clear_snapshot</function>() to discard the
> +   current transaction's statistics snapshot / cache (if any).  The next use
> 
> I'd change s|/ cache|or cached values|.  I think "/" like this is an informal
> thing.

I think we have a few other uses of it. But anyway, changed.

Thanks!

Andres



Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-03 21:15:16 -0700, Andres Freund wrote:
> - collect who reviewed earlier revisions

I found reviews by
- Tomas Vondra <tomas.vondra@2ndquadrant.com>
- Arthur Zakirov <a.zakirov@postgrespro.ru>
- Antonin Houska <ah@cybertec.at>

There's also reviews by Fujii and Alvaro, but afaics just for parts that were
separately committed.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:

On Sun, Apr 3, 2022 at 9:16 PM Andres Freund <andres@anarazel.de> wrote:

Please take a look!


I didn't take the time to fixup all the various odd typos in the general code comments; none of them reduced comprehension appreciably.  I may do so when/if I do another pass.

I did skim over the entire patch set and, FWIW, found it to be quite understandable.  I don't have the experience to comment on the lower-level details like locking and such but the medium picture stuff makes sense to me both as a user and a developer.  I did leave a couple of comments about parts that at least piqued my interest (reset single stats) or seemed like an undesirable restriction that was under addressed (before server shutdown called exactly once).

I agree with Thomas's observation regarding PGSTAT_KIND_LAST.  I also think that leaving it starting at 1 makes sense - maybe just fix the name and comment to better reflect its actual usage in core.

I concur also with changing usages of " / " to ", or"

My first encounter with pg_stat_exists_stat() didn't draw my attention as being problematic so I'd say we just stick with it.  As a SQL user reading: WHERE exists (...) is somewhat natural; using "have" or back-to-back stat_stat is less appealing.

I would suggest we do away with stats_fetch_consistency "snapshot" mode and instead add a function that can be called that would accomplish the same thing but in "cache" mode.  Future iterations of that function could accept patterns, allowing for something between "one" and "everything".

I'm also not an immediate fan of "fetch_consistency"; with the function suggestion it is basically "cache" and "no-cache" so maybe: stats_use_transaction_cache ? (haven't thought hard or long on this one...)


 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 22d0a1e491..e889c11d9e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2123,7 +2123,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </row>
      <row>
       <entry><literal>PgStatsData</literal></entry>
-      <entry>Waiting fo shared memory stats data access</entry>
+      <entry>Waiting for shared memory stats data access</entry>
      </row>
      <row>
       <entry><literal>SerializableXactHash</literal></entry>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2689d0962c..bc7bdf8064 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4469,7 +4469,7 @@ PostgresMain(const char *dbname, const char *username)
 
  /*
  * (4) turn off the idle-in-transaction, idle-session and
- * idle-state-update timeouts if active.  We do this before step (5) so
+ * idle-stats-update timeouts if active.  We do this before step (5) so
  * that any last-moment timeout is certain to be detected in step (5).
  *
  * At most one of these timeouts will be active, so there's no need to
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index dbd55a065d..370638b33b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -5,7 +5,7 @@
  * Provides the infrastructure to collect and access cumulative statistics,
  * e.g. per-table access statistics, of all backends in shared memory.
  *
- * Most statistics updates are first first accumulated locally in each process
+ * Most statistics updates are first accumulated locally in each process
  * as pending entries, then later flushed to shared memory (just after commit,
  * or by idle-timeout).
  *
@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
  * during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one process?
+ *   I'd presume stats are not actually lost in that case.  Can we just 'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as possible"
  * We currently only write out stats for proc_exit(0). We might want to change
  * that at some point... But right now pgstat_discard_stats() would be called
  * during the start after a disorderly shutdown, anyway.
@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid objoid)
 
  Assert(!pgstat_kind_info_for(kind)->fixed_amount);
 
+ /*
+ * More of a conceptual observation here - the fact that something is fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have single subentries
+ * that could be addressed.
+ * I also am unsure, off the top of my head, whether both replication slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this patch enables that capability)
+ */
+
  /* Set the reset timestamp for the whole database */
  pgstat_reset_database_timestamp(MyDatabaseId, ts);
 

David J.

Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-04 13:45:40 -0700, David G. Johnston wrote:
> I didn't take the time to fixup all the various odd typos in the general
> code comments; none of them reduced comprehension appreciably.  I may do so
> when/if I do another pass.

Cool.


> My first encounter with pg_stat_exists_stat() didn't draw my attention as
> being problematic so I'd say we just stick with it.  As a SQL user reading:
> WHERE exists (...) is somewhat natural; using "have" or back-to-back
> stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?


> I would suggest we do away with stats_fetch_consistency "snapshot" mode and
> instead add a function that can be called that would accomplish the same
> thing but in "cache" mode.  Future iterations of that function could accept
> patterns, allowing for something between "one" and "everything".

I don't want to do that. We had a lot of discussion around what consistency
model we want, and Tom was adamant that there needs to be a mode that behaves
like the current consistency model (which is what snapshot behaves like, with
very minor differences).  A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.


(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)


> @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
>  /*
>   * pgstat_before_server_shutdown() needs to be called by exactly one
> process
>   * during regular server shutdowns. Otherwise all stats will be lost.
> - *
> + * XXX: What bad things happen if this is invoked by more than one process?
> + *   I'd presume stats are not actually lost in that case.  Can we just
> 'no-op'
> + *   subsequent calls and say "at least once at shutdown, as late as
> possible"

What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such?  I'd
like to write out stats more often in the future (e.g. in the checkpointer),
but then it'd not be written out with this function...


> @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> objoid)
> 
>   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> 
> + /*
> + * More of a conceptual observation here - the fact that something is
> fixed does not imply
> + * that it is not fixed at a value greater than zero and thus could have
> single subentries
> + * that could be addressed.

pgstat_reset_single_counter() is a pre-existing function (with a pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.


> + * I also am unsure, off the top of my head, whether both replication
> slots and subscriptions,
> + * which are fixed, can be reset singly (today, and/or whether this patch
> enables that capability)
> + */

FWIW, neither are implemented as fixed amount stats. There's afaics no limit
at all for the number of existing subscriptions (although some would either
need to be disabled or you'd get errors). While there is a limit on the number
of slots, that's a configurable limit. So replication slot stats are also
implemented as variable amount stats (that used to be different, wasn't nice).

There's one example of fixed amount stats that can be reset more granularly,
namely slru. That can be done via pg_stat_reset_slru().

Thanks,

Andres



Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:
On Mon, Apr 4, 2022 at 2:06 PM Andres Freund <andres@anarazel.de> wrote:

> My first encounter with pg_stat_exists_stat() didn't draw my attention as
> being problematic so I'd say we just stick with it.  As a SQL user reading:
> WHERE exists (...) is somewhat natural; using "have" or back-to-back
> stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?


Works for me.
 
A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.

Makes sense.
(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)


> @@ -371,7 +371,9 @@ pgstat_discard_stats(void)
>  /*
>   * pgstat_before_server_shutdown() needs to be called by exactly one
> process
>   * during regular server shutdowns. Otherwise all stats will be lost.
> - *
> + * XXX: What bad things happen if this is invoked by more than one process?
> + *   I'd presume stats are not actually lost in that case.  Can we just
> 'no-op'
> + *   subsequent calls and say "at least once at shutdown, as late as
> possible"

What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

Mostly OCD  I guess, "exactly one" has two failure modes - zero, and > 1; and the "Otherwise" only covers the zero mode.


I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such?  I'd
like to write out stats more often in the future (e.g. in the checkpointer),
but then it'd not be written out with this function...


Yeah, the idea only really works if you can implement "last one out, shut off the lights".  I think I was subconsciously wanting this to work that way, but the existing process is good.
  

> @@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
> objoid)
>
>   Assert(!pgstat_kind_info_for(kind)->fixed_amount);
>
> + /*
> + * More of a conceptual observation here - the fact that something is
> fixed does not imply
> + * that it is not fixed at a value greater than zero and thus could have
> single subentries
> + * that could be addressed.

pgstat_reset_single_counter() is a pre-existing function (with a pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.


> + * I also am unsure, off the top of my head, whether both replication
> slots and subscriptions,
> + * which are fixed, can be reset singly (today, and/or whether this patch
> enables that capability)
> + */

FWIW, neither are implemented as fixed amount stats.

That was a typo, I meant to write variable.  My point was that of these 5 kinds that will pass the assertion test only 2 of them are actually handled by the function today.

+ PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+ PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */


There's one example of fixed amount stats that can be reset more granularly,
namely slru. That can be done via pg_stat_reset_slru().


Right, hence the conceptual disconnect.  It doesn't affect the implementation, everything is working just fine, but is something to ponder for future maintainers getting up to speed here.

As the existing function only handles functions and relations why not just perform a specific Kind check for them?  Generalizing to assert on whether or not the function works on fixed or variable Kinds seems beyond its present state.  Or could it be used, as-is, for databases, replication slots, and subscriptions today, and we just haven't migrated those areas to use the now generalized function?  Even then, unless we do expand the definition of the this publicly facing function is seems better to precisely define what it requires as an input Kind by checking for RELATION or FUNCTION specifically.

David J.

Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-04 14:25:57 -0700, David G. Johnston wrote:
> > You mentioned this as a restriction above - I'm not seeing it as such?  I'd
> > like to write out stats more often in the future (e.g. in the
> > checkpointer),
> > but then it'd not be written out with this function...
> >
> >
> Yeah, the idea only really works if you can implement "last one out, shut
> off the lights".  I think I was subconsciously wanting this to work that
> way, but the existing process is good.

Preserving stats more than we do today (the patch doesn't really affect that)
will require a good chunk more work. My idea for it is that we'd write the
file out as part of a checkpoint / restartpoint, with a name including the
redo-lsn. Then when recovery starts, it can use the stats file associated with
that to start from.  Then we'd loose at most 1 checkpoint's worth of stats
during a crash, not more.

There's a few non-trivial corner cases to solve, around stats objects getting
dropped concurrently with creating that serialized snapshot. Solvable, but not
trivial.


> > > + * I also am unsure, off the top of my head, whether both replication
> > > slots and subscriptions,
> > > + * which are fixed, can be reset singly (today, and/or whether this
> > patch
> > > enables that capability)
> > > + */
> >
> > FWIW, neither are implemented as fixed amount stats.
> 
> 
> That was a typo, I meant to write variable.  My point was that of these 5
> kinds that will pass the assertion test only 2 of them are actually handled
> by the function today.
> 
> + PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> + PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */

> As the existing function only handles functions and relations why not just
> perform a specific Kind check for them?  Generalizing to assert on whether
> or not the function works on fixed or variable Kinds seems beyond its
> present state.  Or could it be used, as-is, for databases, replication
> slots, and subscriptions today, and we just haven't migrated those areas to
> use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.


> Even then, unless we do expand the
> definition of the this publicly facing function is seems better to
> precisely define what it requires as an input Kind by checking for RELATION
> or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:
On Mon, Apr 4, 2022 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:

> As the existing function only handles functions and relations why not just
> perform a specific Kind check for them?  Generalizing to assert on whether
> or not the function works on fixed or variable Kinds seems beyond its
> present state.  Or could it be used, as-is, for databases, replication
> slots, and subscriptions today, and we just haven't migrated those areas to
> use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.


> Even then, unless we do expand the
> definition of the this publicly facing function is seems better to
> precisely define what it requires as an input Kind by checking for RELATION
> or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?


I could live with adding that, but...

Replacing the existing assert(!kind->fixed_amount) with assert(!kind->accessed_across_databases) produces the same result as the later presently implies the former.

Now I start to dislike the behavioral aspect of the attribute and would rather just name it: kind->is_cluster_scoped (or something else that is descriptive of the stat category itself, not how it is used)

Then reorganize the Kind documentation to note and emphasize these two primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition (if this is true...but given this is an optimization category I'm thinking maybe it doesn't actually matter...)

+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,

David J.

Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> Replacing the existing assert(!kind->fixed_amount) with
> assert(!kind->accessed_across_databases) produces the same result as the
> later presently implies the former.

I wasn't proposing to replace, but to add...


> Now I start to dislike the behavioral aspect of the attribute and would
> rather just name it: kind->is_cluster_scoped (or something else that is
> descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the latter.



> Then reorganize the Kind documentation to note and emphasize these two
> primary descriptors:
> variable, which can be cluster or database scoped
> fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.


> (if this is true...but given this is an optimization category I'm thinking
> maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?


Greetings,

Andres Freund



Re: shared-memory based stats collector - v67

От
Andres Freund
Дата:
Hi,

On 2022-03-23 10:42:03 -0700, Andres Freund wrote:
> On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:
> > +    /*
> > +     * When starting with crash recovery, reset pgstat data - it might not be
> > +     * valid. Otherwise restore pgstat data. It's safe to do this here,
> > +     * because postmaster will not yet have started any other processes
> > +     *
> > +     * TODO: With a bit of extra work we could just start with a pgstat file
> > +     * associated with the checkpoint redo location we're starting from.
> > +     */
> > +    if (ControlFile->state == DB_SHUTDOWNED ||
> > +        ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
> > +        pgstat_restore_stats();
> > +    else
> > +        pgstat_discard_stats();
> > +
> >
> > Before there, InitWalRecovery changes the state to
> > DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
> > DB_IN_PRODUCTION. So the stat seems like always discarded on standby.
>
> Hm. I though it worked at some point. I guess there's a reason this commit is
> a separate commit marked WIP ;)

FWIW, it had gotten broken by

commit be1c00ab13a7c2c9299d60cb5a9d285c40e2506c
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   2022-02-16 09:22:44 +0200

    Move code around in StartupXLOG().

because that moved the spot where
ControlFile->state = DB_IN_CRASH_RECOVERY
is set to an earlier location.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:
On Mon, Apr 4, 2022 at 3:44 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:
> Replacing the existing assert(!kind->fixed_amount) with
> assert(!kind->accessed_across_databases) produces the same result as the
> later presently implies the former.

I wasn't proposing to replace, but to add...

Right, but it seems redundant to have both when one implies the other.  But I'm not hard set against it either, though my idea below make them both obsolete.

> Now I start to dislike the behavioral aspect of the attribute and would
> rather just name it: kind->is_cluster_scoped (or something else that is
> descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the latter.


I like database_scoped as well...but see my idea below that makes this obsolete.

> Then reorganize the Kind documentation to note and emphasize these two
> primary descriptors:
> variable, which can be cluster or database scoped
> fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.

Not married to it myself, something to come back to when the dust settles.


> (if this is true...but given this is an optimization category I'm thinking
> maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?


I mean that distinguishing between stats that are fixed and those that are variable implies that fixed kinds have a better performance (speed, memory) characteristic than variable kinds (at least in part due to the presence of changecount).  If fixed kinds did not have a performance benefit then having the variable kind implementation simply handle fixed kinds as well (using the common struct header and storage in a hash table) would make the implementation simpler since all statistics would report through the same API.  In that world, variability is simply a possibility that not every actual reporter has to use.  That improved performance characteristic is what I meant by "optimization category".  I question whether we should be publishing "fixed" and "variable" as concrete properties.  I'm not presently against the current choice to do so, but as you say above, I'm also not really seeing the benefit.

(goes and looks at all the places that use the fixed_amount field...sparking an idea)

Coming back to this:
"""
+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,
"""

I see three "KIND_GROUP" categories here:
PGSTAT_KIND_CLUSTER (open to a different word here though...)
PGSTAT_KIND_DATABASE (we seem to agree on this above)
PGSTAT_KIND_GLOBAL (already used in the code)

This single enum can replace the two booleans that, in combination, would define 4 unique groups (of which only three are interesting - database+fixed doesn't seem interesting and so is not given a name/value here).

While the succinctness of the booleans has appeal the need for half of the booleans to end up being negated quickly tarnishes it.  With the three groups, every assertion is positive in nature indicating which of the three groups are handled by the function.  While that is probably a few more characters it seems like an easier read and is less complicated as it has fewer independent parts.  At most you OR two kinds together which is succinct enough I would think.  There are no gaps relative to the existing implementation that defines fixed_amount and accessed_across_databases - every call site using either of them can be transformed mechanically.

David J.

Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-04 19:03:13 -0700, David G. Johnston wrote:
> > > (if this is true...but given this is an optimization category I'm
> > thinking
> > > maybe it doesn't actually matter...)
> >
> > It is true. Not sure what you mean with "optimization category"?
> >
> >
> I mean that distinguishing between stats that are fixed and those that are
> variable implies that fixed kinds have a better performance (speed, memory)
> characteristic than variable kinds (at least in part due to the presence of
> changecount).  If fixed kinds did not have a performance benefit then
> having the variable kind implementation simply handle fixed kinds as well
> (using the common struct header and storage in a hash table) would make the
> implementation simpler since all statistics would report through the same
> API.

Yes, fixed-numbered stats are faster.



> Coming back to this:
> """
> + /* cluster-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
> + PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
> + PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
> spot to be closer to the database-scoped options)
> +
> + /* database-scoped object stats having a variable number of entries */
> + PGSTAT_KIND_RELATION, /* per-table statistics */
> + PGSTAT_KIND_FUNCTION, /* per-function statistics */
> +
> + /* cluster-scoped stats having a fixed number of entries */ (maybe these
> should go first, the variable following?)
> + PGSTAT_KIND_ARCHIVER,
> + PGSTAT_KIND_BGWRITER,
> + PGSTAT_KIND_CHECKPOINTER,
> + PGSTAT_KIND_SLRU,
> + PGSTAT_KIND_WAL,
> """
> 
> I see three "KIND_GROUP" categories here:
> PGSTAT_KIND_CLUSTER (open to a different word here though...)
> PGSTAT_KIND_DATABASE (we seem to agree on this above)
> PGSTAT_KIND_GLOBAL (already used in the code)
> 
> This single enum can replace the two booleans that, in combination, would
> define 4 unique groups (of which only three are interesting -
> database+fixed doesn't seem interesting and so is not given a name/value
> here).

The more I think about it, the less I think a split like that makes sense. The
difference between PGSTAT_KIND_CLUSTER / PGSTAT_KIND_DATABASE is tiny. Nearly
all code just deals with both together.

I think all this is going to achieve is to making code more complicated. There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v69

От
Andres Freund
Дата:
Hi,

Thanks for the reviews Justin, Thomas, David. I tried to incorporate the
feedback, with the exception of the ongoing discussion around
accessed_across_databases. I've also not renamed pg_stat_exists_stat() yet,
not clear who likes what :)

Changes in v69:

- merged feedback
- committed the first few commits, mostly pretty boring stuff
- added an architecture overview comment to the top of pgstat.c - not sure if
  it makes sense to anybody but me (and perhaps Horiguchi-san)?
- merged "only reset pgstat data after crash recovery." into the main commit,
  added tests verifying the behaviour of not resetting stats on a standby when
  in SHUTDOWNED_IN_RECOVERY.
- drop variable-amount stats when loading on-disk file fails partway through,
  I'd raised this earlier in [1]
- made most pgstat_report_stat() calls pass force = true. In worker.c, the
  only possibly frequent caller, I instead added a pgstat_report_stat(true) to
  the idle path.
- added a handful more tests, but mostly out of "test coverage vanity" ;)
- made the test output of 030_stats_cleanup_replica a bit more informative,
  plus other minor cleanups


The one definite TODO I know of is
> - fix the bug around pgstat_report_stat() I wrote about at [3]
> [3] https://www.postgresql.org/message-id/20220402081648.kbapqdxi2rr3ha3w@alap3.anarazel.de

I'd hoped Horiguchi-san would chime in on that discussion...

Regards,

Andres


[1] https://www.postgresql.org/message-id/20220329191727.mzzwbl7udhpq7pmf%40alap3.anarazel.de

Вложения

Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:
On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:

I think all this is going to achieve is to making code more complicated. There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

So, I decided to see what this would look like; the results are attached, portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these words are much more meaningful to the reader and code than the existing booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly in favor of making this change.

The ability to create defines for subsets nicely resolves the problem that CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind) are generally related together - they are now grouped under the DYNAMIC label (variable, if you want) while all of the fixed entries get associated with GLOBAL.  Thus the majority of usages, since accessed_across_databases is rare, end up being either DYNAMIC or GLOBAL.  The presence of any other category should give one pause.  We could add an ALL define if we ever decide to consolidate the API - but for now it's largely used to ensure that stats of one type don't get processed by the other.  The boolean fixed does that well enough but this just seems much cleaner and more understandable to me.  Though having made up the terms and model myself, that isn't too surprising.

The only existing usage of accessed_across_databases is in the negative form, which translates to excluding objects, but only those from other actual databases.

@@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
  */
  if (p->key.dboid != MyDatabaseId &&
  p->key.dboid != InvalidOid &&
- !kind_info->accessed_across_databases)
+ kind_info->kind_group == PGSTAT_OBJECT)
  continue;

The only other usage of something other than GLOBAL or DYNAMIC is the restriction on the behavior of reset_single_counter, which also has to be an object in the current database (the later condition being enforced by the presence of a valid object oid I presume).  The replacement for this below is not behavior-preserving, the proposed behavior I believe we agree is correct though.

@@ -652,7 +647,7 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid objoid)
 
- Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_OBJECT);
 
Everything else is a straight conversion of fixed_amount to CLUSTER+OBJECT

@@ -728,7 +723,7 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 
- AssertArg(!kind_info->fixed_amount);
+ AssertArg(kind_info->kind_group == PGSTAT_DYNAMIC);

and !fixed_amount to GLOBAL

@@ -825,7 +820,7 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 bool
 pgstat_exists_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 {
- if (pgstat_kind_info_for(kind)->fixed_amount)
+ if (pgstat_kind_info_for(kind)->kind_group == PGSTAT_GLOBAL)
  return true;
 
  return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;

David J.

Вложения

Re: shared-memory based stats collector - v68

От
Andres Freund
Дата:
Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:
> 
> >
> > I think all this is going to achieve is to making code more complicated.
> > There
> > is a *single* non-assert use of accessed_across_databases and now a single
> > assertion involving it.
> >
> > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> >
> 
> So, I decided to see what this would look like; the results are attached,
> portions of it also inlined below.

> I'll admit this does introduce a terminology problem - but IMO these words
> are much more meaningful to the reader and code than the existing
> booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
> in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but this
isn't it.

Do you think it should be your way strongly enough that you'd not want to get
it in the current way?



> The ability to create defines for subsets nicely resolves the problem that
> CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
> are generally related together - they are now grouped under the DYNAMIC
> label (variable, if you want) while all of the fixed entries get associated
> with GLOBAL.  Thus the majority of usages, since accessed_across_databases
> is rare, end up being either DYNAMIC or GLOBAL.

FWIW, as-is DYNAMIC isn't correct:

> +typedef enum PgStat_KindGroup
> +{
> +    PGSTAT_GLOBAL = 1,
> +    PGSTAT_CLUSTER,
> +    PGSTAT_OBJECT
> +} PgStat_KindGroup;
> +
> +#define PGSTAT_DYNAMIC (PGSTAT_CLUSTER | PGSTAT_OBJECT)

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and then
the tests need to be done with &.

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...


> @@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
>           */
>          if (p->key.dboid != MyDatabaseId &&
>              p->key.dboid != InvalidOid &&
> -            !kind_info->accessed_across_databases)
> +            kind_info->kind_group == PGSTAT_OBJECT)
>              continue;
>  
>          if (p->dropped)

Imo this is far harder to interpret - !kind_info->accessed_across_databases
tells you why we're skipping in clear code. Your alternative doesn't.


> @@ -938,7 +933,7 @@ pgstat_build_snapshot(void)
>      {
>          const PgStat_KindInfo *kind_info = pgstat_kind_info_for(kind);
>  
> -        if (!kind_info->fixed_amount)
> +        if (kind_info->kind_group == PGSTAT_DYNAMIC)

These all would have to be kind_info->kind_group & PGSTAT_DYNAMIC, or even
(kind_group & PGSTAT_DYNAMIC) != 0, depending on the case.


> @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
>      void       *pending_data = entry_ref->pending;
>  
>      Assert(pending_data != NULL);
> -    /* !fixed_amount stats should be handled explicitly */
> -    Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> +    /* global stats should be handled explicitly : why?*/
> +    Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);

The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).


Greetings,

Andres Freund



Re: shared-memory based stats collector - v68

От
"David G. Johnston"
Дата:


On Tuesday, April 5, 2022, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:
>
> >
> > I think all this is going to achieve is to making code more complicated.
> > There
> > is a *single* non-assert use of accessed_across_databases and now a single
> > assertion involving it.
> >
> > What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?
> >
>
> So, I decided to see what this would look like; the results are attached,
> portions of it also inlined below.

> I'll admit this does introduce a terminology problem - but IMO these words
> are much more meaningful to the reader and code than the existing
> booleans.  I'm hopeful we can bikeshed something agreeable as I'm strongly
> in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but this
isn't it.

Do you think it should be your way strongly enough that you'd not want to get
it in the current way?


Not that strongly; I’m good with the code as-is.  Its not pervasive enough to be hard to understand (I may ponder some code comments though) and the system it is modeling has some legacy aspects that are more the root problem and I don’t want to touch those here for sure.
 

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and then
the tests need to be done with &.

Thanks.
 

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...


Yeah, I compiled but tried to run the tests and learned I still need to figure out my setup for make check; then I forgot to make install…

It served its purpose at least.

 

> @@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
>       void       *pending_data = entry_ref->pending;

>       Assert(pending_data != NULL);
> -     /* !fixed_amount stats should be handled explicitly */
> -     Assert(!pgstat_kind_info_for(kind)->fixed_amount);
> +     /* global stats should be handled explicitly : why?*/
> +     Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);

The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).


Thanks.

David J.
 

Re: shared-memory based stats collector - v66

От
Andres Freund
Дата:
Hi,

On 2022-04-02 01:16:48 -0700, Andres Freund wrote:
> I just noticed that the code doesn't appear to actually work like that right
> now. Whenever the timeout is reached, pgstat_report_stat() is called with
> force = true.
> 
> And even if the backend is busy running queries, once there's contention, the
> next invocation of pgstat_report_stat() will return the timeout relative to
> pendin_since, which then will trigger a force report via a very short timeout
> soon.
> 
> It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
> (with a slightly different name) from pgstat_report_stat() when blocked
> (limiting the max reporting delay for an idle connection) and to continue
> calling pgstat_report_stat(force = true).  But to only trigger force
> "internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.
> 
> I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
> connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
> to PGSTAT_MAX_INTERVAL when blocked) on busy connections.
> 
> Makes sense?

I tried to come up with a workload producing a *lot* of stats (multiple
function calls within a transaction, multiple transactions pipelined) and ran
it with 1000 clients (on a machine with 2 x (10 cores / 20 threads)). To
reduce overhead I set
  default_transaction_isolation=repeatable read
  track_activities=false
MVCC Snapshot acquisition is the clear bottleneck otherwise, followed by
pgstat_report_activity() (which, as confusing as it may sound, is independent
of this patch).

I do see a *small* amount of contention if I lower PGSTAT_MIN_INTERVAL to
1ms. Too small to ever be captured in pg_stat_activity.wait_event, but just
about visible in a profiler.


Which leads me to conclude we can simplify the logic significantly. Here's my
current comment explaining the logic:

 * Unless called with 'force', pending stats updates are flushed happen once
 * per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
 * block on lock acquisition, except if stats updates have been pending for
 * longer than PGSTAT_MAX_INTERVAL (60000ms).
 *
 * Whenever pending stats updates remain at the end of pgstat_report_stat() a
 * suggested idle timeout is returned. Currently this is always
 * PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set up
 * a timeout after which to call pgstat_report_stat(true), but are not
 * required to to do so.

Comments?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v69

От
"David G. Johnston"
Дата:
On Mon, Apr 4, 2022 at 8:05 PM Andres Freund <andres@anarazel.de> wrote:
- added an architecture overview comment to the top of pgstat.c - not sure if
  it makes sense to anybody but me (and perhaps Horiguchi-san)?


I took a look at this, diff attached.  Some typos and minor style stuff, plus trying to bring a bit more detail to the caching mechanism.  I may have gotten it wrong in adding more detail though.

+ * read-only, backend-local, transaction-scoped, hashtable (pgStatEntryRefHash)
+ * in front of the shared hashtable, containing references (PgStat_EntryRef)
+ * to shared hashtable entries. The shared hashtable thus only needs to be
+ * accessed when the PgStat_HashKey is not present in the backend-local hashtable,
+ * or if stats_fetch_consistency = 'none'.

I'm under the impression, but didn't try to confirm, that the pending updates don't use the caching mechanism, but rather add to the shared queue, and so the cache is effectively read-only.  It is also transaction-scoped based upon the GUC and the nature of stats vis-a-vis transactions.

Even before I added the read-only and transaction-scoped I got a bit hung up on reading:
"The shared hashtable only needs to be accessed when no prior reference to the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this sentence - even if there is a one-to-one correspondence.

The code comment about having per-kind definitions in pgstat.c being annoying is probably sufficient but it does seem like a valid comment to leave in the architecture as well.  Having them in both places seems OK.

I am wondering why there are no mentions to the header files in this architecture, only the .c files.

David J.

Вложения

Re: shared-memory based stats collector - v69

От
Andres Freund
Дата:
Hi,

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> On Mon, Apr 4, 2022 at 8:05 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > - added an architecture overview comment to the top of pgstat.c - not sure
> > if
> >   it makes sense to anybody but me (and perhaps Horiguchi-san)?
> >
> >
> I took a look at this, diff attached.

Thanks!


> Some typos and minor style stuff,
> plus trying to bring a bit more detail to the caching mechanism.  I may
> have gotten it wrong in adding more detail though.
> 
> + * read-only, backend-local, transaction-scoped, hashtable
> (pgStatEntryRefHash)
> + * in front of the shared hashtable, containing references
> (PgStat_EntryRef)
> + * to shared hashtable entries. The shared hashtable thus only needs to be
> + * accessed when the PgStat_HashKey is not present in the backend-local
> hashtable,
> + * or if stats_fetch_consistency = 'none'.
> 
> I'm under the impression, but didn't try to confirm, that the pending
> updates don't use the caching mechanism

They do.


>, but rather add to the shared queue

Queue? Maybe you mean the hashtable?


>, and so the cache is effectively read-only.  It is also transaction-scoped
>based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

I guess I should add a paragraph about snapshots / fetch consistency.


> Even before I added the read-only and transaction-scoped I got a bit hung
> up on reading:
> "The shared hashtable only needs to be accessed when no prior reference to
> the shared hashtable exists."

> Thinking in terms of key seems to make more sense than value in this
> sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?


> I am wondering why there are no mentions to the header files in this
> architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the current
system).


> diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
> index bfbfe53deb..504f952c0e 100644
> --- a/src/backend/utils/activity/pgstat.c
> +++ b/src/backend/utils/activity/pgstat.c
> @@ -4,9 +4,9 @@
>   *
>   *
>   * PgStat_KindInfo describes the different types of statistics handled. Some
> - * kinds of statistics are collected for fixed number of objects
> - * (e.g. checkpointer statistics). Other kinds are statistics are collected
> - * for variable-numbered objects (e.g. relations).
> + * kinds of statistics are collected for a fixed number of objects
> + * (e.g., checkpointer statistics). Other kinds of statistics are collected

Was that comma after e.g. intentional?

Applied the rest.


> + * for a varying number of objects (e.g., relations).
>   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
>   *
> @@ -19,19 +19,21 @@
>   *
>   * All variable-numbered stats are addressed by PgStat_HashKey while running.
>   * It is not possible to have statistics for an object that cannot be
> - * addressed that way at runtime. A wider identifier can be used when
> + * addressed that way at runtime. A alternate identifier can be used when
>   * serializing to disk (used for replication slot stats).

Not sure this improves things.



>   * The names for structs stored in shared memory are prefixed with
>   * PgStatShared instead of PgStat.
> @@ -53,15 +55,16 @@
>   * entry in pgstat_kind_infos, see PgStat_KindInfo for details.
>   *
>   *
> - * To keep things manageable stats handling is split across several
> + * To keep things manageable, stats handling is split across several

Done.


>   * files. Infrastructure pieces are in:
> - * - pgstat.c - this file, to tie it all together
> + * - pgstat.c - this file, which ties everything together

I liked that :)


> - * Each statistics kind is handled in a dedicated file:
> + * Each statistics kind is handled in a dedicated file, though their structs
> + * are defined here for lack of better ideas.

-0.5

Greetings,

Andres Freund



Re: shared-memory based stats collector - v69

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:

>, but rather add to the shared queue

Queue? Maybe you mean the hashtable?

Queue implemented by a list...?  Anyway, I think I mean this:

/*
 * List of PgStat_EntryRefs with unflushed pending stats.
 *
 * Newly pending entries should only ever be added to the end of the list,
 * otherwise pgstat_flush_pending_entries() might not see them immediately.
 */
static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);
 


>, and so the cache is effectively read-only.  It is also transaction-scoped
>based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

 
Probably...
 
I guess I should add a paragraph about snapshots / fetch consistency.

I apparently confused/combined the two concepts just now so that would help.

> Even before I added the read-only and transaction-scoped I got a bit hung
> up on reading:
> "The shared hashtable only needs to be accessed when no prior reference to
> the shared hashtable exists."

> Thinking in terms of key seems to make more sense than value in this
> sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?

I specifically dislike having two mentions of the "shared hashtable" in the same sentence, so I tried to phrase the second half in terms of the local hashtable.


> I am wondering why there are no mentions to the header files in this
> architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the current
system).

No need to try to come up with something.  More curious if there was a general reason to avoid it before I looked to see if I felt anything in them seemed worth including from my perspective.


> diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
> index bfbfe53deb..504f952c0e 100644
> --- a/src/backend/utils/activity/pgstat.c
> +++ b/src/backend/utils/activity/pgstat.c
> @@ -4,9 +4,9 @@
>   *
>   *
>   * PgStat_KindInfo describes the different types of statistics handled. Some
> - * kinds of statistics are collected for fixed number of objects
> - * (e.g. checkpointer statistics). Other kinds are statistics are collected
> - * for variable-numbered objects (e.g. relations).
> + * kinds of statistics are collected for a fixed number of objects
> + * (e.g., checkpointer statistics). Other kinds of statistics are collected

Was that comma after e.g. intentional?

It is.  That is the style I was taught, and that we seem to adhere to in user-facing documentation.  Source code is a mixed bag with no enforcement, but while we are here...


> + * for a varying number of objects (e.g., relations).
>   * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.

status-quo works for me too, and matches up with the desired labelling we are using here.

 
>   *
> @@ -19,19 +19,21 @@
>   *
>   * All variable-numbered stats are addressed by PgStat_HashKey while running.
>   * It is not possible to have statistics for an object that cannot be
> - * addressed that way at runtime. A wider identifier can be used when
> + * addressed that way at runtime. A alternate identifier can be used when
>   * serializing to disk (used for replication slot stats).

Not sure this improves things.


It just seems odd that width is being mentioned when the actual struct is a combination of three subcomponents.  I do feel I'd need to understand exactly what replication slot stats are doing uniquely here, though, to make any point beyond that.



> - * Each statistics kind is handled in a dedicated file:
> + * Each statistics kind is handled in a dedicated file, though their structs
> + * are defined here for lack of better ideas.

-0.5


Status-quo works for me.  Food for thought for other reviewers though.

David J.

Re: shared-memory based stats collector - v69

От
Andres Freund
Дата:
On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:
> 
> >
> > On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:
> >
> > >, but rather add to the shared queue
> >
> > Queue? Maybe you mean the hashtable?
> >
> 
> Queue implemented by a list...?  Anyway, I think I mean this:

> /*
>  * List of PgStat_EntryRefs with unflushed pending stats.
>  *
>  * Newly pending entries should only ever be added to the end of the list,
>  * otherwise pgstat_flush_pending_entries() might not see them immediately.
>  */
> static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);

That's not in shared memory, but backend local...


> > >, and so the cache is effectively read-only.  It is also
> > transaction-scoped
> > >based upon the GUC and the nature of stats vis-a-vis transactions.
> >
> > No, that's not right. I think you might be thinking of
> > pgStatLocal.snapshot.stats?
> >
> >
> Probably...
> 
> 
> > I guess I should add a paragraph about snapshots / fetch consistency.
> >
> 
> I apparently confused/combined the two concepts just now so that would help.

Will add.


> >
> > > Even before I added the read-only and transaction-scoped I got a bit hung
> > > up on reading:
> > > "The shared hashtable only needs to be accessed when no prior reference
> > to
> > > the shared hashtable exists."
> >
> > > Thinking in terms of key seems to make more sense than value in this
> > > sentence - even if there is a one-to-one correspondence.
> >
> > Maybe "prior reference to the shared hashtable exists for the key"?
> >
> 
> I specifically dislike having two mentions of the "shared hashtable" in the
> same sentence, so I tried to phrase the second half in terms of the local
> hashtable.

You left two mentions of "shared hashtable" in the sentence prior though
:). I'll try to rephrase. But it's not the end if this isn't the most elegant
prose...


> > Was that comma after e.g. intentional?
> >
> 
> It is.  That is the style I was taught, and that we seem to adhere to in
> user-facing documentation.  Source code is a mixed bag with no enforcement,
> but while we are here...

Looks a bit odd to me. But I guess I'll add it then...


> > >   *
> > > @@ -19,19 +19,21 @@
> > >   *
> > >   * All variable-numbered stats are addressed by PgStat_HashKey while
> > running.
> > >   * It is not possible to have statistics for an object that cannot be
> > > - * addressed that way at runtime. A wider identifier can be used when
> > > + * addressed that way at runtime. A alternate identifier can be used
> > when
> > >   * serializing to disk (used for replication slot stats).
> >
> > Not sure this improves things.
> >
> >
> It just seems odd that width is being mentioned when the actual struct is a
> combination of three subcomponents.  I do feel I'd need to understand
> exactly what replication slot stats are doing uniquely here, though, to
> make any point beyond that.

There's no real numeric identifier for replication slot stats. So I'm using
the "index" used in slot.c while running. But that can change during
start/stop.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

Here comes v70:
- extended / polished the architecture comment based on feedback from Melanie
  and David
- other polishing as suggested by David
- addressed the open issue around pgstat_report_stat(), as described in
  https://www.postgresql.org/message-id/20220405204019.6yj7ocmpw352c2u5%40alap3.anarazel.de
- while working on the above point, I noticed that hash_bytes() showed up
  noticeably in profiles, so I replaced it with a fixed-width function
- found a few potential regression test instabilities by either *always*
  flushing in pgstat_report_stat(), or only flushing when force = true.
- random minor improvements
- reordered commits some

I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards
pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function
for testing, so it doesn't really matter.

I think this is basically ready, minus a a few comment adjustments here and
there. Unless somebody protests I'm planning to start pushing things tomorrow
morning.

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

There's lots that can be done once all this is in place, both simplifying
pre-existing code and easy new features, but that's for a later release.

Greetings,

Andres Freund

Вложения

Re: shared-memory based stats collector - v69

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

>
> > I guess I should add a paragraph about snapshots / fetch consistency.
> >
>
> I apparently confused/combined the two concepts just now so that would help.

Will add.


Thank you.

On a slightly different track, I took the time to write-up a "Purpose" section for pgstat.c :

It may possibly be duplicating some things written elsewhere as I didn't go looking for similar prior art yet, I just wanted to get thoughts down.  This is the kind of preliminary framing I've been constructing in my own mind as I try to absorb this patch.  I haven't formed an opinion whether the actual user-facing documentation should cover some or all of this instead of the preamble to pgstat.c (which could just point to the docs for prerequisite reading).

David J.

 * Purpose:

 * The PgStat namespace defines an API that facilitates concurrent access
 * to a shared memory region where cumulative statistical data is saved.
 * At shutdown, one of the running system workers will initiate the writing
 * of the data to file. Then, during startup (following a clean shutdown) the
 * Postmaster process will early on ensure that the file is loaded into memory.
 *
 * Each cumulative statistic producing system must construct a PgStat_Kind
 * datum in this file. The details are described elsewhere, but of
 * particular importance is that each kind is classified as having either a
 * fixed number of objects that it tracks, or a variable number.
 *
 * During normal operations, the different consumers of the API will have their
 * accessed managed by the API, the protocol used is determined based upon whether
 * the statistical kind is fixed-numbered or variable-numbered.
 * Readers of variable-numbered statistics will have the option to locally
 * cache the data, while writers may have their updates locally queued
 * and applied in a batch. Thus favoring speed over freshness.
 * The fixed-numbered statistics are faster to process and thus forgo
 * these mechanisms in favor of a light-weight lock.
 *
 * Cumulative in this context means that processes must, for numeric data, send
 * a delta (or change) value via the API which will then be added to the
 * stored value in memory. The system does not track individual changes, only
 * their net effect. Additionally, both due to unclean shutdown or user request,
 * statistics can be reset - meaning that their stored numeric values are returned
 * to zero, and any non-numeric data that may be tracked (say a timestamp) is cleared.


Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

I think this is basically ready, minus a a few comment adjustments here and
there. Unless somebody protests I'm planning to start pushing things tomorrow
morning.


Nothing I've come across, given my area of experience, gives me pause.  I'm mostly going to focus on docs and comments at this point - to try and help the next person in my position (and end-users) have an easier go at on-boarding.  Toward that end, I did just add a "Purpose" section writeup to the v69 thread.

David J.

Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
I've never tried to review a 24-patch series before. It's kind of
intimidating.... Is there a good place to start to get a good idea of
the most important changes?



Re: shared-memory based stats collector - v69

От
Andres Freund
Дата:
Hi,

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:
> On Tue, Apr 5, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:
> > > On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I guess I should add a paragraph about snapshots / fetch consistency.
> > > >
> > >
> > > I apparently confused/combined the two concepts just now so that would
> > help.
> >
> > Will add.

I at least tried...


> On a slightly different track, I took the time to write-up a "Purpose"
> section for pgstat.c :
> 
> It may possibly be duplicating some things written elsewhere as I didn't go
> looking for similar prior art yet, I just wanted to get thoughts down.

There's very very little prior documentation in this area.


> This is the kind of preliminary framing I've been constructing in my own
> mind as I try to absorb this patch.  I haven't formed an opinion whether
> the actual user-facing documentation should cover some or all of this
> instead of the preamble to pgstat.c (which could just point to the docs for
> prerequisite reading).

>  * The PgStat namespace defines an API that facilitates concurrent access
>  * to a shared memory region where cumulative statistical data is saved.
>  * At shutdown, one of the running system workers will initiate the writing
>  * of the data to file. Then, during startup (following a clean shutdown)
> the
>  * Postmaster process will early on ensure that the file is loaded into
> memory.

I added something roughly along those lines in the version I just sent, based
on a suggestion by Melanie over IM:

 * Statistics are loaded from the filesystem during startup (by the startup
 * process), unless preceded by a crash, in which case all stats are
 * discarded. They are written out by the checkpointer process just before
 * shutting down, except when shutting down in immediate mode.



>  * Each cumulative statistic producing system must construct a PgStat_Kind
>  * datum in this file. The details are described elsewhere, but of
>  * particular importance is that each kind is classified as having either a
>  * fixed number of objects that it tracks, or a variable number.
>  *
>  * During normal operations, the different consumers of the API will have
> their
>  * accessed managed by the API, the protocol used is determined based upon
> whether
>  * the statistical kind is fixed-numbered or variable-numbered.
>  * Readers of variable-numbered statistics will have the option to locally
>  * cache the data, while writers may have their updates locally queued
>  * and applied in a batch. Thus favoring speed over freshness.
>  * The fixed-numbered statistics are faster to process and thus forgo
>  * these mechanisms in favor of a light-weight lock.

This feels a bit jumbled. Of course something using an API will be managed by
the API. I don't know what protocol reallly means?


> Additionally, both due to unclean shutdown or user
> request,
>  * statistics can be reset - meaning that their stored numeric values are
> returned
>  * to zero, and any non-numeric data that may be tracked (say a timestamp)
> is cleared.

I think this is basically covered in the above?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 8:11 PM Greg Stark <stark@mit.edu> wrote:
I've never tried to review a 24-patch series before. It's kind of
intimidating.... Is there a good place to start to get a good idea of
the most important changes?

It isn't as bad as the number makes it sound - I just used "git am" to apply the patches to a branch and skimmed each commit separately.  Most of them are tests or other minor pieces.  The remaining few cover different aspects of the major commit and you can choose them based upon your experience and time.

David J.

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-05 23:11:07 -0400, Greg Stark wrote:
> I've never tried to review a 24-patch series before. It's kind of
> intimidating.... Is there a good place to start to get a good idea of
> the most important changes?

It was more at some point :). And believe me, I find this whole project
intimidating and exhausting. The stats collector is entangled in a lot of
places, and there was a lot of preparatory work to get this point.

Most of the commits aren't really interesting, I broke them out to make the
"main commit" a bit smaller, because it's exhausting to look at a *huge*
single commit. I wish I could have broken it down more, but I didn't find a
good way.

The interesting commit is
v70-0010-pgstat-store-statistics-in-shared-memory.patch
which actually replaces the stats collector by storing stats in shared
memory. It contains a, now hopefully decent, overview of how things work at
the top of pgstat.c.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v69

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 8:14 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:

 * Statistics are loaded from the filesystem during startup (by the startup
 * process), unless preceded by a crash, in which case all stats are
 * discarded. They are written out by the checkpointer process just before
 * shutting down, except when shutting down in immediate mode.


Cool.  I was on the fence about the level of detail here, but mostly excluded mentioning the checkpointer 'cause I didn't want to research the correct answer tonight.

>  * Each cumulative statistic producing system must construct a PgStat_Kind
>  * datum in this file. The details are described elsewhere, but of
>  * particular importance is that each kind is classified as having either a
>  * fixed number of objects that it tracks, or a variable number.
>  *
>  * During normal operations, the different consumers of the API will have
> their
>  * accessed managed by the API, the protocol used is determined based upon
> whether
>  * the statistical kind is fixed-numbered or variable-numbered.
>  * Readers of variable-numbered statistics will have the option to locally
>  * cache the data, while writers may have their updates locally queued
>  * and applied in a batch. Thus favoring speed over freshness.
>  * The fixed-numbered statistics are faster to process and thus forgo
>  * these mechanisms in favor of a light-weight lock.

This feels a bit jumbled.

I had that inkling as well.  First draft and I needed to stop at some point.  It didn't seem bad or wrong at least.

Of course something using an API will be managed by
the API. I don't know what protocol reallly means?


Procedure, process, algorithm are synonyms.  Procedure probably makes more sense here since it is a procedural language we are using.  I thought of algorithm while writing this but it carried too much technical baggage for me (compression, encryption, etc..) that this didn't seem to fit in with.

> Additionally, both due to unclean shutdown or user
> request,
>  * statistics can be reset - meaning that their stored numeric values are
> returned
>  * to zero, and any non-numeric data that may be tracked (say a timestamp)
> is cleared.

I think this is basically covered in the above?


Yes and no.  The first paragraph says they are forced to reset due to system error.  This paragraph basically says that resetting this kind of statistic is an acceptable, and even expected, thing to do.  And in fact can also be done intentionally and not only due to system error.  I am pondering whether to mention this dynamic first and/or better blend it in - but the minor repetition in the different contexts seems ok.

David J.

Re: shared-memory based stats collector - v70

От
John Naylor
Дата:
On Wed, Apr 6, 2022 at 10:00 AM Andres Freund <andres@anarazel.de> wrote:
> - while working on the above point, I noticed that hash_bytes() showed up
>   noticeably in profiles, so I replaced it with a fixed-width function

I'm curious about this -- could you direct me to which patch introduces this?

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: shared-memory based stats collector - v70

От
Alvaro Herrera
Дата:
Just skimming a bit here ...

On 2022-Apr-05, Andres Freund wrote:

> From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Mon, 4 Apr 2022 16:53:16 -0700
> Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.

> +      <entry><literal>PgStatsData</literal></entry>
> +      <entry>Waiting fo shared memory stats data access</entry>
> +     </row>

Typo "fo" -> "for"

> @@ -5302,7 +5317,9 @@ StartupXLOG(void)
>          performedWalRecovery = true;
>      }
>      else
> +    {
>          performedWalRecovery = false;
> +    }

Why? :-)

Why give pgstat_get_entry_ref the responsibility of initializing
created_entry to false?  The vast majority of callers don't care about
that flag; it seems easier/cleaner to set it to false in
pgstat_init_function_usage (the only caller that cares that I could
find) before calling pgstat_prep_pending_entry.

(I suggest pgstat_prep_pending_entry should have a comment line stating
"*created_entry, if not NULL, is set true if the entry required to be
created.", same as pgstat_get_entry_ref.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-06 13:31:31 +0200, Alvaro Herrera wrote:
> Just skimming a bit here ...

Thanks!


> On 2022-Apr-05, Andres Freund wrote:
> 
> > From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Mon, 4 Apr 2022 16:53:16 -0700
> > Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.
> 
> > +      <entry><literal>PgStatsData</literal></entry>
> > +      <entry>Waiting fo shared memory stats data access</entry>
> > +     </row>
> 
> Typo "fo" -> "for"

Oh, oops. I had fixed that in the wrong patch.


> > @@ -5302,7 +5317,9 @@ StartupXLOG(void)
> >          performedWalRecovery = true;
> >      }
> >      else
> > +    {
> >          performedWalRecovery = false;
> > +    }
> 
> Why? :-)

Damage from merging two commits yesterday. I'd left open where exactly we'd
reset stats, with the "main commit" implementing the current behaviour more
closely, and then a followup commit implementing something a bit
better. Nobody seemed to argue for keeping the behaviour 1:1, so I merged
them. Without removing the parens again :)


> Why give pgstat_get_entry_ref the responsibility of initializing
> created_entry to false?  The vast majority of callers don't care about
> that flag; it seems easier/cleaner to set it to false in
> pgstat_init_function_usage (the only caller that cares that I could
> find) before calling pgstat_prep_pending_entry.

It's annoying to have to initialize it, I agree. But I think it's bugprone for
the caller to know that it has to be pre-initialized to false.


> (I suggest pgstat_prep_pending_entry should have a comment line stating
> "*created_entry, if not NULL, is set true if the entry required to be
> created.", same as pgstat_get_entry_ref.)

Added something along those lines.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-06 16:24:28 +0700, John Naylor wrote:
> On Wed, Apr 6, 2022 at 10:00 AM Andres Freund <andres@anarazel.de> wrote:
> > - while working on the above point, I noticed that hash_bytes() showed up
> >   noticeably in profiles, so I replaced it with a fixed-width function
> 
> I'm curious about this -- could you direct me to which patch introduces this?

Commit 0010, search for pgstat_hash_key_hash. For simplicity I'm including it
here inline:

/* helpers for dshash / simplehash hashtables */
static inline int
pgstat_hash_key_cmp(const void *a, const void *b, size_t size, void *arg)
{
    AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);
    return memcmp(a, b, sizeof(PgStat_HashKey));
}

static inline uint32
pgstat_hash_key_hash(const void *d, size_t size, void *arg)
{
    const PgStat_HashKey *key = (PgStat_HashKey *)d;
    uint32 hash;

    AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);

    hash = murmurhash32(key->kind);
    hash = hash_combine(hash, murmurhash32(key->dboid));
    hash = hash_combine(hash, murmurhash32(key->objoid));

    return hash;
}

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Lukas Fittl
Дата:
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
Here comes v70:

Some small nitpicks on the docs:

> From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> From: Andres Freund <andres@anarazel.de>
> Date: Sat, 2 Apr 2022 19:38:01 -0700
> Subject: [PATCH v70 14/27] pgstat: update docs.
> ...
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> -      These parameters control server-wide statistics collection features.
> -      When statistics collection is enabled, the data that is produced can be
> +      These parameters control server-wide cumulative statistics system.
> +      When enabled, the data that is collected can be

Missing "the" ("These parameters control the server-wide cumulative statistics system").

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> +   any of the accumulated statistics, acessed values are cached until the end

"acessed" => "accessed"

> +   <varname>stats_fetch_consistency</varname> can be set
> +   <literal>snapshot</literal>, at the price of increased memory usage for

Missing "to" ("can be set to <literal>snapshot</literal>")

> +   caching not-needed statistics data.  Conversely, if it's known that statistics

Double space between "data." and "Conversely" (not sure if that matters)

> +   current transaction's statistics snapshot or cached values (if any).  The

Double space between "(if any)." and "The" (not sure if that matters)

> +   next use of statistical information will cause a new snapshot to be built
> +   or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a new snapshot and caches accessed statistics)

Thanks,
Lukas

--
Lukas Fittl

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-06 12:14:35 -0700, Lukas Fittl wrote:
> On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Here comes v70:
> >
> 
> Some small nitpicks on the docs:

Thanks!

> > From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Sat, 2 Apr 2022 19:38:01 -0700
> > Subject: [PATCH v70 14/27] pgstat: update docs.
> > ...
> > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> > -      These parameters control server-wide statistics collection
> features.
> > -      When statistics collection is enabled, the data that is produced
> can be
> > +      These parameters control server-wide cumulative statistics system.
> > +      When enabled, the data that is collected can be
> 
> Missing "the" ("These parameters control the server-wide cumulative
> statistics system").

> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > +   any of the accumulated statistics, acessed values are cached until
> the end
> 
> "acessed" => "accessed"

> > +   <varname>stats_fetch_consistency</varname> can be set
> > +   <literal>snapshot</literal>, at the price of increased memory usage
> for
> 
> Missing "to" ("can be set to <literal>snapshot</literal>")

Fixed.

> > +   caching not-needed statistics data.  Conversely, if it's known that
> statistics
> 
> Double space between "data." and "Conversely" (not sure if that matters)
> > +   current transaction's statistics snapshot or cached values (if any).
> The
> 
> Double space between "(if any)." and "The" (not sure if that matters)

That's done pretty widely in the docs and comments.


> > +   next use of statistical information will cause a new snapshot to be
> built
> > +   or accessed statistics to be cached.
> 
> I believe this should be an "and", not an "or". (next access builds both a
> new snapshot and caches accessed statistics)

I *think* or is correct? The new snapshot is when stats_fetch_consistency =
snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
make that clearer without making it a lot longer. Suggestions?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Justin Pryzby
Дата:
On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > +   next use of statistical information will cause a new snapshot to be built
> > > +   or accessed statistics to be cached.
> > 
> > I believe this should be an "and", not an "or". (next access builds both a
> > new snapshot and caches accessed statistics)
> 
> I *think* or is correct? The new snapshot is when stats_fetch_consistency =
> snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
> make that clearer without making it a lot longer. Suggestions?

I think it's correct.  Maybe it's clearer to say:

+   next use of statistical information will (when in snapshot mode) cause a new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.




Re: shared-memory based stats collector - v70

От
Lukas Fittl
Дата:
On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > +   next use of statistical information will cause a new snapshot to be built
> > > +   or accessed statistics to be cached.
> >
> > I believe this should be an "and", not an "or". (next access builds both a
> > new snapshot and caches accessed statistics)
>
> I *think* or is correct? The new snapshot is when stats_fetch_consistency =
> snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
> make that clearer without making it a lot longer. Suggestions?

I think it's correct.  Maybe it's clearer to say:

+   next use of statistical information will (when in snapshot mode) cause a new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.

Ah, yes, that does clarify what was meant.

+1 to Justin's edit, or something like it.

Thanks,
Lukas

--
Lukas Fittl

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I just noticed an existing incoherency that I'm wondering about fixing as part
of 0007 "pgstat: prepare APIs used by pgstatfuncs for shared memory stats."

The SQL functions to reset function and relation stats
pg_stat_reset_single_table_counters() and
pg_stat_reset_single_function_counters() respectively both make use of
pgstat_reset_single_counter().

Note that the SQL function uses plural "counters" (which makes sense, it
resets all counters for that object), whereas the C function they call to
perform the the reset uses singular.

Similarly pg_stat_reset_slru(), pg_stat_reset_replication_slot(),
pg_stat_reset_subscription_stats() SQL function use
pgstat_reset_subscription_counter(), pgstat_reset_replslot_counter() and
pgstat_reset_subscription_counter() to reset either the stats for one or all
SLRUs/slots.

This is relevant for the commit mentioned above because it separates the
functions to reset the stats for one slru / slot / sub from the function to
reset all slrus / slots / subs. Going with the existing naming I'd just named
them pgstat_reset_*_counters(). But that doesn't really make sense.


If it were just existing code I'd just not touch this for now. But because the
patch introduces further functions, I'd rather not introducing more weird
function names.

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

We could leave out the _all_ and just use plural too, but I think it's a bit
nicer with _all_ in there.

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:


I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

I like having the SQL function paired with a matching implementation in this scheme.
 

We could leave out the _all_ and just use plural too, but I think it's a bit
nicer with _all_ in there.

+1 to _all_
 

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep the pairing) and they can call the renamed pgstat_reset_function_or_table() internally (since the function indeed handle both paths and we’ve yet to come up with a label to use instead of “function and table stats”)?

These are private functions right?

David J.


Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:
> 
> >
> >
> > I'd go for
> > pgstat_reset_slru_counter() -> pgstat_reset_slru()
> > pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
> > pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
> > pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
> > pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()
> 
> 
> I like having the SQL function paired with a matching implementation in
> this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

I'd like to remove the NULL -> all behaviour, but that should be discussed
separately.


I've hacked up the above, but after doing so I think I found a cleaner
approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter(). pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.


> > Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
> > for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?
> >
> 
> Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
> the pairing) and they can call the renamed pgstat_reset_function_or_table()
> internally (since the function indeed handle both paths and we’ve yet to
> come up with a label to use instead of “function and table stats”)?
> 
> These are private functions right?

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Wed, Apr 6, 2022 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:
> On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:
>
>
> I like having the SQL function paired with a matching implementation in
> this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

Right, I meant the naming scheme proposed was acceptable.  Not that I wanted to change the SQL functions too.

I've hacked up the above, but after doing so I think I found a cleaner
approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter().
 
pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.


I can see benefits of both, or even possibly combining them.  Absent being able to point to some other part of the system and saying "it is done this way there, let's do the same here" I think the details will inform the decision.  The fact there is just the one outlier here suggests that this is indeed the better option.
 
 
What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.


I was thinking specifically of the freedom to rename and not break extensions.  Namely, are these truly implementation details or something that, while unlikely to be used by extensions, still constitute an exposed API?  It was mainly a passing thought, I'm not looking for a crash-course in how all that works right now.

David J.

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-06 17:01:17 -0700, David G. Johnston wrote:
> On Wed, Apr 6, 2022 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:
>
> The fact there is just the one outlier here suggests that this is indeed the
> better option.

FWIW, the outlier also uses pgstat_reset(), just with a small wrapper doing
the translation from slot name to slot index.


> > What does "private" mean for you? They're exposed via pgstat.h not
> > pgstat_internal.h. But not to SQL.
> I was thinking specifically of the freedom to rename and not break
> extensions.  Namely, are these truly implementation details or something
> that, while unlikely to be used by extensions, still constitute an exposed
> API?  It was mainly a passing thought, I'm not looking for a crash-course
> in how all that works right now.

I doubt there are extension using these functions - and they'd have been
broken the way things were in v70, because the signature already had changed.

Generally, between major releases, we don't worry too much about changing C
APIs. Of course we try to avoid unnecessarily breaking things, particularly
when it's going to cause widespread breakage.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:
> It'll be a few hours to get to the main commit - but except for 0001 it
> doesn't make sense to push without intending to push later changes too. I
> might squash a few commits togther.

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> I've gotten through the main commits (and then a fix for the apparently
> inevitable bug that's immediately highlighted by the buildfarm), and the first
> test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].


Working on the docs. Found a few things worth raising:

1)
Existing text:
   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the <filename>pg_stat</filename> subdirectory, so that
   statistics can be retained across server restarts.  When recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

   When the server shuts down cleanly, a permanent copy of the statistics
   data is stored in the <filename>pg_stat</filename> subdirectory, so that
   statistics can be retained across server restarts.  When crash recovery is
   performed at server start (e.g., after immediate shutdown, server crash,
   and point-in-time recovery, but not when starting a standby that was shut
   down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?


2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
...
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The cumulative statistics system is active during recovery. All scans, reads, blocks,
     index usage, etc., will be recorded normally on the standby. Replayed
     actions will not duplicate their effects on primary, so replaying an
     insert will not increment the Inserts column of pg_stat_user_tables.
     The stats file is deleted at the start of recovery, so stats from primary
     and standby will differ; this is considered a feature, not a bug.
    </para>

    <para>

I'll just commit the necessary bit, but we really ought to rephrase this.




Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20220407165709.jgdkrzqlkcwue6ko%40alap3.anarazel.de



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
On 2022-04-07 16:37:51 -0700, Andres Freund wrote:
> On 2022-04-07 00:28:45 -0700, Andres Freund wrote:
> > I've gotten through the main commits (and then a fix for the apparently
> > inevitable bug that's immediately highlighted by the buildfarm), and the first
> > test. I'll call it a night now, and work on the other tests & docs tomorrow.
> 
> I've gotten through the tests now. There's one known, not yet addressed, issue
> with the stats isolation test, see [1].

That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24



Re: shared-memory based stats collector

От
Ranier Vilela
Дата:
Hi,

Per Coverity.

pgstat_reset_entry does not check if lock it was really blocked.
I think if shared_stat_reset_contents is called without lock,
is it an issue not?

regards,

Ranier Vilela

Вложения

Re: shared-memory based stats collector

От
Andres Freund
Дата:
Hi,

On April 8, 2022 4:49:48 AM PDT, Ranier Vilela <ranier.vf@gmail.com> wrote:
>Hi,
>
>Per Coverity.
>
>pgstat_reset_entry does not check if lock it was really blocked.
>I think if shared_stat_reset_contents is called without lock,
>is it an issue not?

I don't think so - the nowait parameter is say to false, so the lock acquisition is blocking.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
Here comes v70:


One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me "2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value restriction does make sense.  Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus this was a behavior change.  v14 reports the initdb timestamp (e.g., 2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is happy being null, this seems to be a "fixed" type behavior) but have it continue to report initdb as its initial value?

David J.

Re: shared-memory based stats collector - v70

От
"David G. Johnston"
Дата:
On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:
Here comes v70:


One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me "2000-01-01 00:00:00" (v15).  I was expecting null, though a non-null value restriction does make sense.  Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus this was a behavior change.  v14 reports the initdb timestamp (e.g., 2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is happy being null, this seems to be a "fixed" type behavior) but have it continue to report initdb as its initial value?


Sorry, apparently this "2000-01-01" behavior only manifests after crash recovery on v15 (didn't check v14); after a clean initdb on v15 I got the same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for these instead of 2000-01-01 (which I guess is derived from 0) if we are not willing to produce null (and it seems other parts of the system using these stats assumes non-null).

David J.

David J.



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> david.g.johnston@gmail.com> wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.

> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

Yes, that's definitely not correct. I see the bug (need to call
pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
easy to fix - too fried to write a test tonight, but will commit the fix
tomorrow.

Thanks for catching!

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Michael Paquier
Дата:
On Wed, Apr 13, 2022 at 04:56:45PM -0700, David G. Johnston wrote:
> Sorry, apparently this "2000-01-01" behavior only manifests after crash
> recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> same initdb timestamp.
>
> Feels like we should still report the "end of crash recovery timestamp" for
> these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> willing to produce null (and it seems other parts of the system using these
> stats assumes non-null).

I can see this timestamp as well after crash recovery.  This seems
rather misleading to me.  I have added an open item.
--
Michael

Вложения

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-04-13 17:55:18 -0700, Andres Freund wrote:
> On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:
> > On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
> > david.g.johnston@gmail.com> wrote:
> > Sorry, apparently this "2000-01-01" behavior only manifests after crash
> > recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
> > same initdb timestamp.
> 
> > Feels like we should still report the "end of crash recovery timestamp" for
> > these instead of 2000-01-01 (which I guess is derived from 0) if we are not
> > willing to produce null (and it seems other parts of the system using these
> > stats assumes non-null).
> 
> Yes, that's definitely not correct. I see the bug (need to call
> pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
> easy to fix - too fried to write a test tonight, but will commit the fix
> tomorrow.

Pushed the fix (including a test that previously failed). Thanks again!

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
So I'm finally wrapping my head around this new code. There is
something I'm surprised by that perhaps I'm misreading or perhaps I
shouldn't be surprised by, not sure.

Is it true that the shared memory allocation contains the hash table
entry and body of every object in every database? I guess I was
assuming I would find some kind of LRU cache which loaded data from
disk on demand. But afaict it loads everything on startup and then
never loads from disk later. The disk is purely for recovering state
after a restart.

On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.



Re: shared-memory based stats collector - v70

От
Melanie Plageman
Дата:

On Wed, Jul 20, 2022 at 11:35 AM Greg Stark <stark@mit.edu> wrote:
On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

I think efforts to do such a thing have gotten caught up in solving
issues around visibility and managing the relationship between local and
global caches [1]. It doesn't seem like the primary technical concern
was memory usage.

[1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-07-20 11:35:13 -0400, Greg Stark wrote:
> Is it true that the shared memory allocation contains the hash table
> entry and body of every object in every database?

Yes. However, note that that was already the case with the old stats
collector - it also kept everything in memory. In addition every read
access to stats loaded a copy of the stats (well of the global stats and
the relevant per-database stats).

It might be worth doing something fancier at some point - the shared
memory stats was already a huge effort, cramming yet another change in
there would pretty much have guaranteed that it'd fail.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Tom Lane
Дата:
Melanie Plageman <melanieplageman@gmail.com> writes:
> On Wed, Jul 20, 2022 at 11:35 AM Greg Stark <stark@mit.edu> wrote:
>> It seems like if we really think the total number of database objects
>> is reasonably limited to scales that fit in RAM there would be a much
>> simpler database design that would just store the catalog tables in
>> simple in-memory data structures and map them all on startup without
>> doing all the work Postgres does to make relational storage scale.

> I think efforts to do such a thing have gotten caught up in solving
> issues around visibility and managing the relationship between local and
> global caches [1]. It doesn't seem like the primary technical concern
> was memory usage.

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects.  The only thing that's changed is that now those entries are
in shared memory instead of process-local memory.  We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

            regards, tom lane



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-07-20 12:08:35 -0400, Tom Lane wrote:
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.

Yep.


> The only thing that's changed is that now those entries are in shared
> memory instead of process-local memory.  We'd be well advised to be
> sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

FWIW, I ran a few memory usage benchmarks. Without stats accesses the
memory usage with shared memory stats was sometimes below, sometimes
above the "old" memory usage, depending on the number of objects. As
soon as there's stats access, it's well below (that includes things like
autovac workers).

I think there's quite a bit of memory usage reduction potential around
dsa.c - we occasionally end up with [nearly] unused dsm segments.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Wed, 20 Jul 2022 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> AFAIR, the previous stats collector implementation had no such provision
> either: it'd just keep adding hashtable entries as it received info about
> new objects.  The only thing that's changed is that now those entries are
> in shared memory instead of process-local memory.  We'd be well advised to
> be sure that memory can be swapped out under pressure, but otherwise I'm
> not seeing that things have gotten worse.

Just to be clear I'm not looking for ways things have gotten worse.
Just trying to understand what I'm reading and I guess I came in with
assumptions that led me astray.

But... adding entries as it received info about new objects isn't the
same as having info on everything. I didn't really understand how the
old system worked but if you had a very large schema but each session
only worked with a small subset did the local stats data ever absorb
info on the objects it never touched?

All that said -- having all objects loaded in shared memory makes my
work way easier. It actually seems feasible to dump out all the
objects from shared memory and including objects from other databases
and if I don't need a consistent snapshot it even seems like it would
be possible to do that without having a copy of more than one stats
entry at a time in local memory. I hope that doesn't cause huge
contention on the shared hash table to be doing that regularly.

-- 
greg



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On July 20, 2022 8:41:53 PM GMT+02:00, Greg Stark <stark@mit.edu> wrote:
>On Wed, 20 Jul 2022 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> AFAIR, the previous stats collector implementation had no such provision
>> either: it'd just keep adding hashtable entries as it received info about
>> new objects.  The only thing that's changed is that now those entries are
>> in shared memory instead of process-local memory.  We'd be well advised to
>> be sure that memory can be swapped out under pressure, but otherwise I'm
>> not seeing that things have gotten worse.
>
>Just to be clear I'm not looking for ways things have gotten worse.
>Just trying to understand what I'm reading and I guess I came in with
>assumptions that led me astray.
>
>But... adding entries as it received info about new objects isn't the
>same as having info on everything. I didn't really understand how the
>old system worked but if you had a very large schema but each session
>only worked with a small subset did the local stats data ever absorb
>info on the objects it never touched?

Each backend only had stats for things it touched. But the stats collector read all files at startup into hash tables
andabsorbed all generated stats into those as well. 


>All that said -- having all objects loaded in shared memory makes my
>work way easier.

What are your trying to do?

>It actually seems feasible to dump out all the
>objects from shared memory and including objects from other databases
>and if I don't need a consistent snapshot it even seems like it would
>be possible to do that without having a copy of more than one stats
>entry at a time in local memory. I hope that doesn't cause huge
>contention on the shared hash table to be doing that regularly.

The stats accessors now default to not creating a full snapshot of stats data at first access (but that's
configurable).So yes, that behavior is possible. E.g. autovac now uses a single object access like you describe. 

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Wed, 20 Jul 2022 at 15:09, Andres Freund <andres@anarazel.de> wrote:
>
> Each backend only had stats for things it touched. But the stats collector read all files at startup into hash tables
andabsorbed all generated stats into those as well.
 

Fascinating. I'm surprised this didn't raise issues previously for
people with millions of tables. I wonder if it wasn't causing issues
and we just didn't hear about them because there were other bigger
issues :)


> >All that said -- having all objects loaded in shared memory makes my
> >work way easier.
>
> What are your trying to do?

I'm trying to implement an exporter for prometheus/openmetrics/etc
that dumps directly from shared memory without going through the SQL
backend layer. I believe this will be much more reliable, lower
overhead, safer, and consistent than writing SQL queries.

Ideally I would want to dump out the stats without connecting to each
database. I suspect that would run into problems where the schema
really adds a lot of information (such as which table each index is on
or which table a toast relation is for. There are also some things
people think of as stats that are maintained in the catalog such as
reltuples and relpages. So I'm imagining this won't strictly stay true
in the end.

It seems like just having an interface to iterate over the shared hash
table and return entries one by one without filtering by database
would be fairly straightforward and I would be able to do most of what
I want just with that. There's actually enough meta information in the
stats entries to be able to handle them as they come instead of trying
to process look up specific stats one by one.


-- 
greg



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 7/21/22 5:07 PM, Greg Stark wrote:
> On Wed, 20 Jul 2022 at 15:09, Andres Freund <andres@anarazel.de> wrote:
>> What are your trying to do?
> Ideally I would want to dump out the stats without connecting to each
> database.

I can see the use case too (specially for monitoring tools) of being 
able to collect the stats without connecting to each database.

> It seems like just having an interface to iterate over the shared hash
> table and return entries one by one without filtering by database
> would be fairly straightforward and I would be able to do most of what
> I want just with that.

What do you think about adding a function in core PG to provide such 
functionality? (means being able to retrieve all the stats (+ eventually 
add some filtering) without the need to connect to each database).

If there is some interest, I'd be happy to work on it and propose a patch.

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
>
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually
> add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks. I
think this is important for making monitoring highly reliable and low
impact on production. It's also kind of fundamental to accessing stats
for objects from other databases since we won't have easy access to
the catalogs for the other databases.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table. I'm thinking of
separating the shared memory hash scan from the metric scan so the
list can be quickly  built minimizing the time the lock is held. We
could possibly also only rebuild that list at a lower frequency than
the metrics gathering so new objects might not show up instantly.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects. Even on
typical databases there are tons of builtin objects (especially
functions) that are probably never getting updates.

-- 
greg



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:
> What do you think about adding a function in core PG to provide such
> functionality? (means being able to retrieve all the stats (+ eventually add
> some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:
> I was more aiming at a C function that extensions could use directly
> rather than an SQL function -- though I suppose having the former it
> would be simple enough to implement the latter using it. (though it
> would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...


> The reason I want a C function is I'm trying to get as far as I can
> without a connection to a database, without a transaction, without
> accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?


> I think this is important for making monitoring highly reliable and low
> impact on production.

I'm doubtful about that, but whatever.


> The main problem with my current code is that I'm accessing the shared
> memory hash table directly. This means the I'm possibly introducing
> locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.


> I'm thinking of separating the shared memory hash scan from the metric scan
> so the list can be quickly built minimizing the time the lock is held.

I'd really really want to see some evidence that any sort of complexity here
is worth it.


> I have a few things I would like to suggest for future improvements to
> this infrastructure. I haven't polished the details of it yet but the
> main thing I think I'm missing is the catalog name for the object. I
> don't want to have to fetch it from the catalog and in any case I
> think it would generally be useful and might regularize the
> replication slot handling too.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.


> I also think it would be nice to have a change counter for every stat
> object, or perhaps a change time. Prometheus wouldn't be able to make
> use of it but other monitoring software might be able to receive only
> metrics that have changed since the last update which would really
> help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/9/22 6:40 PM, Andres Freund wrote:
> Hi,
>
> On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:
>> What do you think about adding a function in core PG to provide such
>> functionality? (means being able to retrieve all the stats (+ eventually add
>> some filtering) without the need to connect to each database).
> I'm not that convinced by the use case, but I think it's also low cost to add
> and maintain, so if somebody cares enough to write something...

Ack.

>
> The only thing I would "request" is that such a function requires more
> permissions than the default accessors do. I think it's a minor problem that
> we allow so much access within a database right now, regardless of object
> permissions, but it'd not be a great idea to expand that to other databases,
> in bulk?

Agree that special attention would need to be pay around permissions.

Something like allow its usage if member of pg_read_all_stats?

Regards,

-- 

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:
> On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>
>> What do you think about adding a function in core PG to provide such
>> functionality? (means being able to retrieve all the stats (+ eventually
>> add some filtering) without the need to connect to each database).
> I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

Out of curiosity, would you be also interested by such a feature for 
previous versions (that will not get the patch in) ?

Regards,

-- 

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/9/22 6:47 PM, Andres Freund wrote:
> Hi,
>
> On 2022-08-09 12:00:46 -0400, Greg Stark wrote:
>> I was more aiming at a C function that extensions could use directly
>> rather than an SQL function -- though I suppose having the former it
>> would be simple enough to implement the latter using it. (though it
>> would have to be one for each stat type I guess)
> I think such a C extension could exist today, without patching core code? It'd
> be a bit ugly to include pgstat_internal.h, I guess, but other than that...

Yeah, agree that writing such an extension is doable today.

>> The main problem with my current code is that I'm accessing the shared
>> memory hash table directly. This means the I'm possibly introducing
>> locking contention on the shared memory hash table.
> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

+1

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Tue, 9 Aug 2022 at 12:48, Andres Freund <andres@anarazel.de> wrote:

> > The reason I want a C function is I'm trying to get as far as I can
> > without a connection to a database, without a transaction, without
> > accessing the catalog, and as much as possible without taking locks.
>
> I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

> I don't think that's a large enough issue to worry about unless you're
> polling at a very high rate, which'd be a bad idea in itself. If a backend
> can't get the lock for some stats change it'll defer flushing the stats a bit,
> so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

> I'm *dead* set against including catalog names in shared memory stats. That'll
> add a good amount of memory usage and complexity, without any sort of
> comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.



> > I also think it would be nice to have a change counter for every stat
> > object, or perhaps a change time. Prometheus wouldn't be able to make
> > use of it but other monitoring software might be able to receive only
> > metrics that have changed since the last update which would really
> > help on databases with large numbers of mostly static objects.
>
> I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

-- 
greg



Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
One thing that's bugging me is that the names we use for these stats
are *all* over the place.

The names go through three different stages

pgstat structs  ->  pgstatfunc tupledescs  ->  pg_stat_* views

(Followed by a fourth stage where pg_exporter or whatever names for
the monitoring software)

And for some reason both transitions (plus the exporter) felt the need
to fiddle with the names or values. And not in any sort of even
vaguely consistent way. So there are three (or four) different sets of
names for the same metrics :(

e.g.

* Some of the struct elements have abbreviated words which are
expanded in the tupledesc names or the view columns -- some have long
names which get abbreviated.

* Some struct members have n_ prefixes (presumably to avoid C keywords
or other namespace issues?) and then lose them at one of the other
stages. But then the relation stats do not have n_ prefixes and then
the pg_stat view *adds* n_ prefixes in the SQL view!

* Some columns are added together in the SQL view which seems like
gratuitously hiding information from the user. The pg_stat_*_tables
view actually looks up information from the indexes stats and combines
them to get idx_scan and idx_tup_fetch.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?


Some of the simpler renaming could be transparently fixed by making
the internal stats match the public facing names. But for many of them
I think the internal names are better. And the cases where the views
aggregate data in a way that loses information are not something I
want to reproduce.

I had intended to use the internal names directly, reasoning that
transparency and consistency are the direction to be headed. But in
some cases I think the current public names are the better choice -- I
certainly don't want to remove n_* prefixes from some names but then
add them to different names! And some of the cases where the data is
combined or modified do seem like they would be missed.

-- 
greg



Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> Hi,
>
> On 8/9/22 6:00 PM, Greg Stark wrote:
> > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
> >>
> >> What do you think about adding a function in core PG to provide such
> >> functionality? (means being able to retrieve all the stats (+ eventually
> >> add some filtering) without the need to connect to each database).
> > I'm working on it myself too. I'll post a patch for discussion in a bit.
>
> Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

I'm definitely not wedded to the idea of using callbacks, it was just
the most convenient way to get started, especially when I was putting
the main loop in pgstat.c.  Ideally I do want to keep open the
possibility of streaming the results out without buffering the whole
set in memory.

> Out of curiosity, would you be also interested by such a feature for
> previous versions (that will not get the patch in) ?

I always had trouble understanding the existing stats code so I was
hoping the new code would make it easier. It seems to have worked but
it's possible I'm wrong and it was always possible and the problem was
always just me :)


-- 
greg

Вложения

Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-10 14:18:25 -0400, Greg Stark wrote:
> > I don't think that's a large enough issue to worry about unless you're
> > polling at a very high rate, which'd be a bad idea in itself. If a backend
> > can't get the lock for some stats change it'll defer flushing the stats a bit,
> > so it'll not cause a lot of other problems.
> 
> Hm. I wonder if we're on the same page about what constitutes a "high rate".
> 
> I've seen people try push prometheus or other similar systems to 5s
> poll intervals. That would be challenging for Postgres due to the
> volume of statistics. The default is 30s and people often struggle to
> even have that function for large fleets. But if you had a small
> fleet, perhaps an iot style system with a "one large table" type of
> schema you might well want stats every 5s or even every 1s.

That's probably fine. Although I think you might run into trouble not from the
stats subystem side, but from the "amount of data" side. On a system with a
lot of objects that can be a fair amount.  If you really want to do very low
latency stats reporting, I suspect you'd have to build an incremental system.


> > I'm *dead* set against including catalog names in shared memory stats. That'll
> > add a good amount of memory usage and complexity, without any sort of
> > comensurate gain.
> 
> Well it's pushing the complexity there from elsewhere. If the labels
> aren't in the stats structures then the exporter needs to connect to
> each database, gather all the names into some local cache and then it
> needs to worry about keeping it up to date. And if there are any
> database problems such as disk errors or catalog objects being locked
> then your monitoring breaks though perhaps it can be limited to just
> missing some object names or having out of date names.

Shrug. If the stats system state desynchronizes from an alter table rename
you'll also have a problem in monitoring.

And even if you can benefit from having all that information, it'd still be an
overhead born by everybody for a very small share of users.


> > > I also think it would be nice to have a change counter for every stat
> > > object, or perhaps a change time. Prometheus wouldn't be able to make
> > > use of it but other monitoring software might be able to receive only
> > > metrics that have changed since the last update which would really
> > > help on databases with large numbers of mostly static objects.
> >
> > I think you're proposing adding overhead that doesn't even have a real user.
> 
> I guess I'm just brainstorming here. I don't need to currently no. It
> doesn't seem like significant overhead though compared to the locking
> and copying though?

Yes, timestamps aren't cheap to determine (nor free too store, but that's a
lesser issue).

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-10 15:48:15 -0400, Greg Stark wrote:
> One thing that's bugging me is that the names we use for these stats
> are *all* over the place.

Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san
did as well.  I had to limit the amount of cleanup done to make it feasible to
get anything committed. I think it's a bit less bad than before, but by no
means good.


> * The pg_stat_bgwriter view returns data from two different fixed
> entries, the checkpointer and the bgwriter, is there a reason those
> are kept separately but then reported as if they're one thing?

Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't
anymore.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/10/22 11:25 PM, Greg Stark wrote:
> On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>> Hi,
>>
>> On 8/9/22 6:00 PM, Greg Stark wrote:
>>> On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>>> What do you think about adding a function in core PG to provide such
>>>> functionality? (means being able to retrieve all the stats (+ eventually
>>>> add some filtering) without the need to connect to each database).
>>> I'm working on it myself too. I'll post a patch for discussion in a bit.
>> Great! Thank you!
> So I was adding the code to pgstat.c because I had thought there were
> some data types I needed and/or static functions I needed. However you
> and Andres encouraged me to check again now. And indeed I was able,
> after fixing a couple things, to make the code work entirely
> externally.

Nice!

Though I still think to have an SQL API in core could be useful to.

As Andres was not -1 about that idea (as it should be low cost to add 
and maintain) as long as somebody cares enough to write something: then 
I'll give it a try and submit a patch for it.

>
> This is definitely not polished and there's a couple obvious things
> missing. But at the risk of embarrassment I've attached my WIP. Please
> be gentle :) I'll post the github link in a bit when I've fixed up
> some meta info.

Thanks! I will have a look at it on github (once you share the link).

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> As Andres was not -1 about that idea (as it should be low cost to add
> and maintain) as long as somebody cares enough to write something: then
> I'll give it a try and submit a patch for it.

I agree it would be a useful feature. I think there may be things to
talk about here though.

1) Are you planning to go through the local hash table and
LocalSnapshot and obey the consistency mode? I was thinking a flag
passed to build_snapshot to request global mode might be sufficient
instead of a completely separate function.

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive. I
actually was hoping to get to the point where I could start shipping
out network data as they arrive and not even buffer up the response,
but I think I need to be careful about hash table locking then.

3) They key difference here is that we're returning whatever stats are
in the hash table rather than using the catalog to drive a list of id
numbers to look up. I guess the API should make it clear this is what
is being returned -- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

4) I'm currently looping over the hash table returning the records all
intermixed. Some users will probably want to do things like "return
all Relation records for all databases" or "return all Index records
for database id xxx". So some form of filtering may be best or perhaps
a way to retrieve just the keys so they can then be looked up one by
one (through the local cache?).

5) On that note I'm not clear how the local cache will interact with
these cross-database lookups. That should probably be documented...

-- 
greg



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/15/22 4:46 PM, Greg Stark wrote:
> On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>> As Andres was not -1 about that idea (as it should be low cost to add
>> and maintain) as long as somebody cares enough to write something: then
>> I'll give it a try and submit a patch for it.
> I agree it would be a useful feature. I think there may be things to
> talk about here though.
>
> 1) Are you planning to go through the local hash table and
> LocalSnapshot and obey the consistency mode? I was thinking a flag
> passed to build_snapshot to request global mode might be sufficient
> instead of a completely separate function.

I think the new API should behave as PGSTAT_FETCH_CONSISTENCY_NONE (as 
querying from all the databases increases the risk of having to deal 
with "large" number of objects).

I've in mind to do something along those lines (still need to add some 
filtering, extra check on the permission,...):

+       dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
+       while ((p = dshash_seq_next(&hstat)) != NULL)
+       {
+               Datum values[PG_STAT_GET_ALL_TABLES_STATS_COLS];
+               bool nulls[PG_STAT_GET_ALL_TABLES_STATS_COLS];
+               PgStat_StatTabEntry * tabentry = NULL;
+               MemSet(values, 0, sizeof(values));
+               MemSet(nulls, false, sizeof(nulls));
+

+               if (p->key.kind != PGSTAT_KIND_RELATION)
+                       continue;

+               if (p->dropped)
+                       continue;
+
+               stats_data = dsa_get_address(pgStatLocal.dsa, p->body);
+               LWLockAcquire(&stats_data->lock, LW_SHARED);
+               tabentry = pgstat_get_entry_data(PGSTAT_KIND_RELATION, 
stats_data);
+
+
+               values[0] = ObjectIdGetDatum(p->key.dboid);
+               values[1] = ObjectIdGetDatum(p->key.objoid);
+               values[2]= DatumGetInt64(tabentry->tuples_inserted);

.

.

+               tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, 
values, nulls);
+               LWLockRelease(&stats_data->lock);
+       }
+       dshash_seq_term(&hstat);

What do you think?

> 2) When I did the function attached above I tried to avoid returning
> the whole set and make it possible to process them as they arrive.

Is it the way it has been done? (did not look at your function yet)

> I
> actually was hoping to get to the point where I could start shipping
> out network data as they arrive and not even buffer up the response,
> but I think I need to be careful about hash table locking then.

If using dshash_seq_next() the already returned elements are locked.

But I guess you would like to unlock them (if you are able to process 
them as they arrive)?

> 3) They key difference here is that we're returning whatever stats are
> in the hash table rather than using the catalog to drive a list of id
> numbers to look up.

Right.

> I guess the API should make it clear this is what
> is being returned

Right. I think we'll end up with a set of relations id (not their names) 
and their associated stats.

> -- on that note I wonder if I've done something
> wrong because I noted a few records with InvalidOid where I didn't
> expect it.

It looks like that InvalidOid for the dbid means that the entry is for a 
shared relation.

Where did you see them (while not expecting them)?

> 4) I'm currently looping over the hash table returning the records all
> intermixed. Some users will probably want to do things like "return
> all Relation records for all databases" or "return all Index records
> for database id xxx". So some form of filtering may be best or perhaps
> a way to retrieve just the keys so they can then be looked up one by
> one (through the local cache?).

I've in mind to add some filtering on the dbid (I think it could be 
useful for monitoring tool with a persistent connection to one database 
but that wants to pull the stats database per database).

I don't think a look up through the local cache will work if the 
entry/key is related to another database the API is launched from.

> 5) On that note I'm not clear how the local cache will interact with
> these cross-database lookups. That should probably be documented...

yeah I don't think that would work (if by local cache you mean what is 
in the relcache).

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
>
> +               if (p->key.kind != PGSTAT_KIND_RELATION)
> +                       continue;

Hm. So presumably this needs to be extended. Either to let the caller
decide which types of stats to return or to somehow return all the
stats intermixed. In my monitoring code I did the latter because I
didn't think going through the hash table repeatedly would be very
efficient. But it's definitely a pretty awkward API since I need a
switch statement that explicitly lists each case and casts the result.

> > 2) When I did the function attached above I tried to avoid returning
> > the whole set and make it possible to process them as they arrive.
>
> Is it the way it has been done? (did not look at your function yet)

I did it with callbacks. It was quick and easy and convenient for my
use case. But in general I don't really like callbacks and would think
some kind of iterator style api would be nicer.

I am handling the stats entries as they turn up. I'm constructing the
text output for each in a callback and buffering up the whole http
response in a string buffer.

I think that's ok but if I wanted to avoid buffering it up and do
network i/o then I would think the thing to do would be to build the
list of entry keys and then loop over that list doing a hash lookup
for each one and generating the response for each out and writing it
to the network. That way there wouldn't be anything locked, not even
the hash table, while doing network i/o. It would mean a lot of
traffic on the hash table though.

> > -- on that note I wonder if I've done something
> > wrong because I noted a few records with InvalidOid where I didn't
> > expect it.
>
> It looks like that InvalidOid for the dbid means that the entry is for a
> shared relation.

Ah yes. I had actually found that but forgotten it.

There's also a database entry with dboid=InvalidOid which is
apparently where background workers with no database attached report
stats.

> I've in mind to add some filtering on the dbid (I think it could be
> useful for monitoring tool with a persistent connection to one database
> but that wants to pull the stats database per database).
>
> I don't think a look up through the local cache will work if the
> entry/key is related to another database the API is launched from.

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

-- 
greg



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-17 15:46:42 -0400, Greg Stark wrote:
> Isn't there also a local hash table used to find the entries to reduce
> traffic on the shared hash table? Even if you don't take a snapshot
> does it get entered there? There are definitely still parts of this
> I'm working on a pretty vague understanding of :/

Yes, there is. But it's more about code that generates stats, rather than
reporting functions. While there's backend local pending stats we need to have
a refcount on the shared stats item so that the stats item can't be dropped
and then revived when those local stats are flushed.

Relevant comments from pgstat.c:

 * To avoid contention on the shared hashtable, each backend has a
 * backend-local hashtable (pgStatEntryRefHash) in front of the shared
 * hashtable, containing references (PgStat_EntryRef) to shared hashtable
 * entries. The shared hashtable only needs to be accessed when no prior
 * reference is found in the local hashtable. Besides pointing to the
 * shared hashtable entry (PgStatShared_HashEntry) PgStat_EntryRef also
 * contains a pointer to the shared statistics data, as a process-local
 * address, to reduce access costs.
 *
 * The names for structs stored in shared memory are prefixed with
 * PgStatShared instead of PgStat. Each stats entry in shared memory is
 * protected by a dedicated lwlock.
 *
 * Most stats updates are first accumulated locally in each process as pending
 * entries, then later flushed to shared memory (just after commit, or by
 * idle-timeout). This practically eliminates contention on individual stats
 * entries. For most kinds of variable-numbered pending stats data is stored
 * in PgStat_EntryRef->pending. All entries with pending data are in the
 * pgStatPending list. Pending statistics updates are flushed out by
 * pgstat_report_stat().
 *

pgstat_internal.h has more details about the refcount aspect:

 * Per-object statistics are stored in the "shared stats" hashtable. That
 * table's entries (PgStatShared_HashEntry) contain a pointer to the actual stats
 * data for the object (the size of the stats data varies depending on the
 * kind of stats). The table is keyed by PgStat_HashKey.
 *
 * Once a backend has a reference to a shared stats entry, it increments the
 * entry's refcount. Even after stats data is dropped (e.g., due to a DROP
 * TABLE), the entry itself can only be deleted once all references have been
 * released.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/17/22 9:46 PM, Greg Stark wrote:
> On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>>
>> +               if (p->key.kind != PGSTAT_KIND_RELATION)
>> +                       continue;
> Hm. So presumably this needs to be extended. Either to let the caller
> decide which types of stats to return or to somehow return all the
> stats intermixed. In my monitoring code I did the latter because I
> didn't think going through the hash table repeatedly would be very
> efficient. But it's definitely a pretty awkward API since I need a
> switch statement that explicitly lists each case and casts the result.

What I had in mind is to provide an API to retrieve stats for those that 
would need to connect to each database individually otherwise.

That's why I focused on PGSTAT_KIND_RELATION that has 
PgStat_KindInfo.accessed_across_databases set to false.

I think that another candidate could also be PGSTAT_KIND_FUNCTION.

I think that's the 2 cases where a monitoring tool connected to a single 
database is currently missing stats related to databases it is not 
connected to.

So what about 2 functions? one to get the stats for the relations and 
one to get the stats for the functions? (And maybe a view on top of each 
of them?)

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/18/22 1:30 AM, Andres Freund wrote:
> Hi,
>
> On 2022-08-17 15:46:42 -0400, Greg Stark wrote:
>> Isn't there also a local hash table used to find the entries to reduce
>> traffic on the shared hash table? Even if you don't take a snapshot
>> does it get entered there? There are definitely still parts of this
>> I'm working on a pretty vague understanding of :/
> Yes, there is. But it's more about code that generates stats, rather than
> reporting functions. While there's backend local pending stats we need to have
> a refcount on the shared stats item so that the stats item can't be dropped
> and then revived when those local stats are flushed.

What do you think about something along those lines for the reporting 
part only?

Datum
pgstat_fetch_all_tables_stats(PG_FUNCTION_ARGS)
{
     int         dbid = PG_ARGISNULL(0) ? -1 : (int) PG_GETARG_OID(0);
     ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
     dshash_seq_status hstat;
     PgStatShared_HashEntry *p;
     PgStatShared_Common *stats_data;

     /* Only members of pg_read_all_stats can use this function */
     if (!has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
     {
         aclcheck_error(ACLCHECK_NO_PRIV, OBJECT_FUNCTION, 
"pgstat_fetch_all_tables_stats");
     }

     pgstat_assert_is_up();

     SetSingleFuncCall(fcinfo, 0);

     dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
     while ((p = dshash_seq_next(&hstat)) != NULL)
     {
         Datum           values[PG_STAT_GET_ALL_TABLES_STATS_COLS];
         bool            nulls[PG_STAT_GET_ALL_TABLES_STATS_COLS];
         PgStat_StatTabEntry * tabentry = NULL;

         MemSet(values, 0, sizeof(values));
         MemSet(nulls, false, sizeof(nulls));

         /* If looking for specific dbid, ignore all the others */
         if (dbid != -1 && p->key.dboid != (Oid) dbid)
             continue;

         /* If the entry is not of kind relation then ignore it */
         if (p->key.kind != PGSTAT_KIND_RELATION)
             continue;

         /* If the entry has been dropped then ignore it */
         if (p->dropped)
             continue;

         stats_data = dsa_get_address(pgStatLocal.dsa, p->body);
         LWLockAcquire(&stats_data->lock, LW_SHARED);
         tabentry = pgstat_get_entry_data(p->key.kind, stats_data);

         values[0] = ObjectIdGetDatum(p->key.dboid);
         values[1] = ObjectIdGetDatum(p->key.objoid);
         values[2]= DatumGetInt64(tabentry->tuples_inserted);
         values[3]= DatumGetInt64(tabentry->tuples_updated);
         values[4]= DatumGetInt64(tabentry->tuples_deleted);
         .
         .
         .
         tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, 
values, nulls);

         LWLockRelease(&stats_data->lock);
     }
     dshash_seq_term(&hstat);

     return (Datum) 0;
}

I also tried to make use of pgstat_get_entry_ref() but went into a 
failed assertion: pgstat_get_entry_ref -> dshash_find -> 
ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table) due to lock acquired by 
dshash_seq_next().

Regards,

-- 

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
Greg Stark
Дата:
On Thu, 18 Aug 2022 at 02:27, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
>
> What I had in mind is to provide an API to retrieve stats for those that
> would need to connect to each database individually otherwise.
>
> That's why I focused on PGSTAT_KIND_RELATION that has
> PgStat_KindInfo.accessed_across_databases set to false.
>
> I think that another candidate could also be PGSTAT_KIND_FUNCTION.

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.


On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure? It would make a
lot of sense for pg_stat_statements to add its entries here instead of
having to reimplement a lot of the same magic. And I have in mind an
extension that allows adding other stats and it would be nice to avoid
having to reimplement any of this.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data. So the kind record could contain tupledesc and the code to
construct the returned tuple so that these functions could return any
custom entry as well as the standard entries.

-- 
greg



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:
> And indexes of course. It's a bit frustrating since without the
> catalog you won't know what table the index actually is for... But
> they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.


> On that note though... What do you think about having the capability
> to add other stats kinds to the stats infrastructure?

Getting closer to that was one of my goals working on the shared memory stats
stuff.


> It would make a lot of sense for pg_stat_statements to add its entries here
> instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.


> To do that I guess more of the code needs to be moved to be table
> driven from the kind structs either with callbacks or with other meta
> data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.


> So the kind record could contain tupledesc and the code to construct the
> returned tuple so that these functions could return any custom entry as well
> as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"Drouvot, Bertrand"
Дата:
Hi,

On 8/18/22 9:51 PM, Andres Freund wrote:
> Hi,
>
> On 2022-08-18 15:26:31 -0400, Greg Stark wrote:
>> And indexes of course. It's a bit frustrating since without the
>> catalog you won't know what table the index actually is for... But
>> they're pretty important stats.
> FWIW, I think we should split relation stats into table and index
> stats. Historically it'd have added a lot of complexity to separate the two,
> but I don't think that's the case anymore. And we waste space for index stats
> by having lots of table specific fields.

It seems to me that we should work on that first then, what do you 
think? (If so I can try to have a look at it).

And once done then resume the work to provide the APIs to get all 
tables/indexes from all the databases.

That way we'll be able to provide one API for the tables and one for the 
indexes (instead of one API for both like my current POC is doing).

>> On that note though... What do you think about having the capability
>> to add other stats kinds to the stats infrastructure?

I think that's a good idea and that would be great to have.

> Getting closer to that was one of my goals working on the shared memory stats
> stuff.
>
>
>> It would make a lot of sense for pg_stat_statements to add its entries here
>> instead of having to reimplement a lot of the same magic.
> Yes, we should move pg_stat_statements over.
>
> It's pretty easy to get massive contention on stats entries with
> pg_stat_statements, because it doesn't have support for "batching" updates to
> shared stats. And reimplementing the same logic in pg_stat_statements.c
> doesn't make sense.
>
> And the set of normalized queries could probably stored in DSA as well - the
> file based thing we have right now is problematic.
>
>
>> To do that I guess more of the code needs to be moved to be table
>> driven from the kind structs either with callbacks or with other meta
>> data.
> Pretty much all of it already is. The only substantial missing bit is
> reading/writing of stats files, but that should be pretty easy. And of course
> making the callback array extensible.
>
>
>> So the kind record could contain tupledesc and the code to construct the
>> returned tuple so that these functions could return any custom entry as well
>> as the standard entries.
> I don't see how this would work well - we don't have functions returning
> variable kinds of tuples. And what would convert a struct to a tuple?
>
> Nor do I think it's needed - if you have an extension providing a new stats
> kind it can also provide accessors.

I think the same (the extension should be able to do that).

I really like the idea of being able to provide new stats kind.

Regards,

-- 
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com




Re: shared-memory based stats collector - v70

От
"Anton A. Melnikov"
Дата:
Hi!

Found a place in the code of this patch that is unclear to me:

https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658

Owing assert() the next if() should never be performed, but the comment above says the opposite.
Is this assert really needed here? And if so, for what?

Would be glad for clarification.


With the best regards,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: shared-memory based stats collector - v70

От
Andres Freund
Дата:
Hi,

On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:
> Found a place in the code of this patch that is unclear to me:
>
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658
> 
> Owing assert() the next if() should never be performed, but the comment above says the opposite.
> Is this assert really needed here? And if so, for what?

It's code that should be unreachable. But in case it is encountered in a
production scenario, it's not worth taking down the server for it.

Greetings,

Andres Freund



Re: shared-memory based stats collector - v70

От
"Anton A. Melnikov"
Дата:
On 03.12.2024 18:07, Andres Freund wrote:
> Hi,
> 
> On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:
>> Found a place in the code of this patch that is unclear to me:
>>
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658
>>
>> Owing assert() the next if() should never be performed, but the comment above says the opposite.
>> Is this assert really needed here? And if so, for what?
> 
> It's code that should be unreachable. But in case it is encountered in a
> production scenario, it's not worth taking down the server for it.

Thanks! It's clear.
Although there is a test case that lead to this assertion to be triggered.
But i doubt if anything needs to be fixed.
I described this case in as it seems to me suitable thread:
https://www.postgresql.org/message-id/56bf8ff9-dd8c-47b2-872a-748ede82af99%40postgrespro.ru

Would be appreciate if you take a look on it.


With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: shared-memory based stats collector - v70

От
Bertrand Drouvot
Дата:
Hi,

On Wed, Dec 04, 2024 at 04:00:53AM +0300, Anton A. Melnikov wrote:
> 
> On 03.12.2024 18:07, Andres Freund wrote:
> > Hi,
> > 
> > On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:
> > > Found a place in the code of this patch that is unclear to me:
> > >
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658
> > > 
> > > Owing assert() the next if() should never be performed, but the comment above says the opposite.
> > > Is this assert really needed here? And if so, for what?
> > 
> > It's code that should be unreachable. But in case it is encountered in a
> > production scenario, it's not worth taking down the server for it.
> 
> Thanks! It's clear.
> Although there is a test case that lead to this assertion to be triggered.
> But i doubt if anything needs to be fixed.
> I described this case in as it seems to me suitable thread:
> https://www.postgresql.org/message-id/56bf8ff9-dd8c-47b2-872a-748ede82af99%40postgrespro.ru

Thanks! I've the feeling that something has to be fixed, see my comments in
[1]. It might be that the failed assertion does not handle a "valid" scenario.

[1]: https://www.postgresql.org/message-id/Z1BzI/eMTCOKA%2Bj6%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: shared-memory based stats collector - v70

От
Bertrand Drouvot
Дата:
Hi,

On Mon, Dec 09, 2024 at 02:39:58PM +0900, Michael Paquier wrote:
> On Sat, Dec 07, 2024 at 12:31:46PM +0300, Anton A. Melnikov wrote:
> > Completely agree that the original comment needs to be revised,
> > since it implies that it is normal for deleted entries to be here,
> > but it is not the case.
> 
> Yep, so applied v2-0001 to document that, and backpatched it as it is
> kind of important to know about.
> 
> > Maybe it's worth adding a warning as well,
> > similar to the one a few lines below in the code?
> 
>          Assert(!ps->dropped);
>          if (ps->dropped)
> +        {
> +            PgStat_HashKey key = ps->key;
> +            elog(WARNING, "found non-deleted stats entry %u/%u/%llu"
> +                          "at server shutdown",

There is a missing space. I think that should be " at server..." or "...%llu ".

> +                           key.kind, key.dboid,
> +                           (unsigned long long) key.objid);
>              continue;
> +        }
>  
>          /*
>           * This discards data related to custom stats kinds that are unknown
> 
> Not sure how to feel about this suggestion, though.  This would
> produce a warning when building without assertions, but the assertion
> would likely let us more information with a trace during development,
> so..

Right. OTOH I think that could help the tap test added in da99fedf8c to not
rely on assert enabled build (the tap test could "simply" check for the
WARNING in the logfile instead).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: shared-memory based stats collector - v70

От
Bertrand Drouvot
Дата:
Hi,

On Tue, Dec 10, 2024 at 09:54:36AM +0900, Michael Paquier wrote:
> On Mon, Dec 09, 2024 at 08:03:54AM +0000, Bertrand Drouvot wrote:
> > Right. OTOH I think that could help the tap test added in da99fedf8c to not
> > rely on assert enabled build (the tap test could "simply" check for the
> > WARNING in the logfile instead).
> 
> That's true.  Still, the coverage that we have is also enough for
> assert builds, which is what the test is going to run with most of the
> time anyway.

Yeah, that's fine by me and don't see the added value of the WARNING then.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com